Biclustering
Biclustering, block clustering,[1][2] Co-clustering or two-mode clustering[3][4][5] is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix. The term was first introduced by Boris Mirkin[6] to name a technique introduced many years earlier,[6] in 1972, by John A. Hartigan.[7]
Given a set of samples represented by an -dimensional feature vector, the entire dataset can be represented as rows in columns (i.e., an matrix). The Biclustering algorithm generates Biclusters. A Bicluster is a subset of rows which exhibit similar behavior across a subset of columns, or vice versa.
Development
Biclustering was originally introduced by
In 2001 and 2003, I. S. Dhillon published two algorithms applying biclustering to files and words. One version was based on bipartite spectral graph partitioning.
To cluster more than two types of objects, in 2005, Bekkerman expanded the mutual information in Dhillon's theorem from a single pair into multiple pairs.[12]
Complexity
The complexity of the Biclustering problem depends on the exact problem formulation, and particularly on the merit function used to evaluate the quality of a given Bicluster. However, the most interesting variants of this problem are
Types of Biclusters
Bicluster with constant values (a)
When a Biclustering algorithm tries to find a constant-value Bicluster, it reorders the rows and columns of the matrix to group together similar rows and columns, eventually grouping Biclusters with similar values. This method is sufficient when the data is normalized. A perfect constant Bicluster is a matrix(I,J) in which all values a(i,j) are equal to a given constant μ. In tangible data, these entries a(i,j) may be represented with the form n(i,j) + μ where n(i,j) denotes the noise. According to Hartigan's algorithm, by splitting the original data matrix into a set of Biclusters, variance is used to compute constant Biclusters. Hence, a perfect Bicluster may be equivalently defined as a matrix with a variance of zero. In order to prevent the partitioning of the data matrix into Biclusters with the only one row and one column; Hartigan assumes that there are, for example, K Biclusters within the data matrix. When the data matrix is partitioned into K Biclusters, the algorithm ends.
Bicluster with constant values on rows (b) or columns (c)
Unlike the constant-value Biclusters, these types of Biclusters cannot be evaluated solely based on the variance of their values. To finish the identification, the columns and the rows should be normalized first. There are, however, other algorithms, without the normalization step, that can find Biclusters which have rows and columns with different approaches.
Bicluster with coherent values (d, e)
For Biclusters with coherent values on rows and columns, an overall improvement over the algorithms for Biclusters with constant values on rows or on columns should be considered. This algorithm may contain analysis of variance between groups, using co-variance between both rows and columns. In Cheng and Church's theorem, a Bicluster is defined as a subset of rows and columns with almost the same score. The similarity score is used to measure the coherence of rows and columns.
|
|
|
|
|
The relationship between these cluster models and other types of clustering such as correlation clustering is discussed in.[15]
Algorithms
There are many Biclustering
Biclustering algorithms have also been proposed and used in other application fields under the names co-clustering, bi-dimensional clustering, and subspace clustering.[14]
Given the known importance of discovering local patterns in
These
Some recent algorithms have attempted to include additional support for Biclustering rectangular matrices in the form of other
There is an ongoing debate about how to judge the results of these methods, as Biclustering allows overlap between clusters and some
Text clustering can solve the high-dimensional sparse problem, which means clustering text and words at the same time. When clustering text, we need to think about not only the words information, but also the information of words clusters that was composed by words. Then, according to similarity of feature words in the text, will eventually cluster the feature words. This is called co-clustering. There are two advantages of co-clustering: one is clustering the test based on words clusters can extremely decrease the dimension of clustering, it can also appropriate to measure the distance between the tests. Second is mining more useful information and can get the corresponding information in test clusters and words clusters. This corresponding information can be used to describe the type of texts and words, at the same time, the result of words clustering can be also used to text mining and information retrieval.
Several approaches have been proposed based on the information contents of the resulting blocks: matrix-based approaches such as
More recently (Bisson and Hussain)
In text databases, for a document collection defined by a document by term D matrix (of size m by n, m: number of documents, n: number of terms) the cover-coefficient based clustering methodology[27] yields the same number of clusters both for documents and terms (words) using a double-stage probability experiment. According to the cover coefficient concept number of clusters can also be roughly estimated by the following formula where t is the number of non-zero entries in D. Note that in D each row and each column must contain at least one non-zero element.
In contrast to other approaches, FABIA is a multiplicative model that assumes realistic non-Gaussian signal distributions with heavy tails. FABIA utilizes well understood model selection techniques like variational approaches and applies the Bayesian framework. The generative framework allows FABIA to determine the information content of each Bicluster to separate spurious Biclusters from true Biclusters.
See also
- Formal concept analysis
- Biclique
- Galois connection
References
- ^ G. Govaert; M. Nadif (2008). "Block clustering with bernoulli mixture models: Comparison of different approaches". Computational Statistics and Data Analysis. 52 (6): 3233–3245. .
- ^
R. Balamurugan; A.M. Natarajan; K. Premalatha (2015). "Stellar-Mass Black Hole Optimization for Biclustering Microarray Gene Expression Data". Applied Artificial Intelligence. 29 (4): 353–381. S2CID 44624424.
- ^
G. Govaert; M. Nadif (2013). Co-clustering: models, algorithms and applications. ISTE, Wiley. ISBN 978-1-84821-473-6.
- ^ R. Balamurugan; A.M. Natarajan; K. Premalatha (2016). "A Modified Harmony Search Method for Biclustering Microarray Gene Expression Data". International Journal of Data Mining and Bioinformatics. 16 (4): 269–289. .
- ^
Van Mechelen I, Bock HH, De Boeck P (2004). "Two-mode clustering methods:a structured overview". Statistical Methods in Medical Research. 13 (5): 363–94. S2CID 19058237.
- ^ ISBN 978-0-7923-4159-8.
- ^ JSTOR 2284710.
- ^ https://www.cs.princeton.edu/courses/archive/fall03/cs597F/Articles/biclustering_of_expression_data.pdf Cheng Y, Church G M. Biclustering of expression data[C]//Ismb. 2000, 8: 93–103.
- S2CID 11847258.
- S2CID 12286784.
- S2CID 2719002.
- S2CID 858524.
- S2CID 3102766.
- ^ a b
Madeira SC, Oliveira AL (2004). "Biclustering Algorithms for Biological Data Analysis: A Survey". IEEE/ACM Transactions on Computational Biology and Bioinformatics. 1 (1): 24–45. S2CID 206628783.
- S2CID 17363900.
- ^
Tanay A, Sharan R, Kupiec M, Shamir R (2004). "Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data". Proceedings of the National Academy of Sciences. 101 (9): 2981–2986. PMID 14973197.
- ^ Abdullah, Ahsan; Hussain, Amir (2006). "A new biclustering technique based on crossing minimization". Neurocomputing. 69 (16–18): 1882–1896. .
- ^
Reiss DJ, Baliga NS, Bonneau R (2006). "Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks". BMC Bioinformatics. 7: 280–302. PMID 16749936.
- ^
PMID 20418340.
- ^
Orzechowski P, Pańszczyk A, Huang X, Moore JH (2018). "runibic: a Bioconductor package for parallel row-based biclustering of gene expression data". Bioinformatics. 34 (24): 4302–4304. PMID 29939213.
- ^
Orzechowski P, Sipper M, Huang X, Moore JH (2018). "EBIC: an evolutionary-based parallel biclustering algorithm for pattern discovery". Bioinformatics. 34 (21): 3719–3726. PMID 29790909.
- ^
Fanaee-T H, Thoresen, M (2020). "Iterative Multi-mode Discretization: Applications to Co-clustering". Discovery Science. Lecture Notes in Computer Science. Vol. 12323. pp. 94–105. S2CID 222832035.
- ^
Madeira SC, Teixeira MC, Sá-Correia I, Oliveira AL (2010). "Identification of Regulatory Modules in Time Series Gene Expression Data using a Linear Time Biclustering Algorithm". IEEE/ACM Transactions on Computational Biology and Bioinformatics. 1 (7): 153–165. S2CID 7369531.
- ^
Madeira SC, Oliveira AL (2009). "A polynomial time biclustering algorithm for finding approximate expression patterns in gene expression time series". Algorithms for Molecular Biology. 4 (8): 8. PMID 19497096.
- ^
Aguilar-Ruiz JS (2005). "Shifting and scaling patterns from gene expression data". Bioinformatics. 21 (10): 3840–3845. PMID 16144809.
- ^ S2CID 15506600.
- ^
Can, F.; Ozkarahan, E. A. (1990). "Concepts and effectiveness of the cover coefficient based clustering methodology for text databases" (PDF). ACM Transactions on Database Systems. 15 (4): 483–517. S2CID 14309214.
Others
- N.K. Verma, S. Bajpai, A. Singh, A. Nagrare, S. Meena, Yan Cui, "A Comparison of Biclustering Algorithms" in International conference on Systems in Medicine and Biology (ICSMB 2010)in IIT Kharagpur India, pp. 90–97, Dec. 16–18.
- J. Gupta, S. Singh and N.K. Verma "MTBA: MATLAB Toolbox for Biclustering Analysis", IEEE Workshop on Computational Intelligence: Theories, Applications and Future Directions", IIT Kanpur India, pp. 148–152, Jul. 2013.
- A. Tanay. R. Sharan, and R. Shamir, "Biclustering Algorithms: A Survey", In Handbook of Computational Molecular Biology, Edited by Srinivas Aluru, Chapman (2004)
- Kluger Y, Basri R, Chang JT, Gerstein MB (2003). "Spectral Biclustering of Microarray Data: Coclustering Genes and Conditions". Genome Research. 13 (4): 703–716. PMID 12671006.
- Adetayo Kasim, Ziv Shkedy, Sebastian Kaiser, Sepp Hochreiter, Willem Talloen (2016), Applied Biclustering Methods for Big and High-Dimensional Data Using R, Chapman & Hall/CRC Press
- Orzechowski, P., Sipper, M., Huang, X., & Moore, J. H. (2018). EBIC: an evolutionary-based parallel biclustering algorithm for pattern discovery. Bioinformatics.