ELKI
This article contains content that is written like an advertisement. (January 2019) |
Developer(s) | Technical University of Dortmund; initially Ludwig Maximilian University of Munich |
---|---|
Stable release | 0.8.0
/ 5 October 2022 |
Repository | |
Written in | Java platform |
Type | Data mining |
License | AGPL (since version 0.4.0) |
Website | elki-project |
ELKI (Environment for Developing KDD-Applications Supported by Index-Structures) is a
Description
The ELKI framework is written in
ELKI has been used in
Objectives
The university project is developed for use in teaching and research. The source code is written with extensibility and reusability in mind, but is also optimized for performance. The experimental evaluation of algorithms depends on many environmental factors and implementation details can have a large impact on the runtime.[7] ELKI aims at providing a shared codebase with comparable implementations of many algorithms.
As research project, it currently does not offer integration with
Architecture
ELKI is modeled around a
ELKI makes extensive use of Java interfaces, so that it can be extended easily in many places. For example, custom data types, distance functions, index structures, algorithms, input parsers, and output modules can be added and combined without modifying the existing code. This includes the possibility of defining a custom distance function and using existing indexes for acceleration.
ELKI uses a
ELKI uses optimized collections for performance rather than the standard Java API.[8] For loops for example are written similar to C++ iterators:
for (DBIDIter iter = ids.iter(); iter.valid(); iter.advance()) {
relation.get(iter); // E.g., get the referenced object
idcollection.add(iter); // E.g., add the reference to a DBID collection
}
In contrast to typical Java iterators (which can only iterate over objects), this conserves memory, because the iterator can internally use primitive values for data storage. The reduced garbage collection improves the runtime. Optimized collections libraries such as GNU Trove3, Koloboke, and fastutil employ similar optimizations. ELKI includes data structures such as object collections and heaps (for, e.g., nearest neighbor search) using such optimizations.
Visualization
The visualization module uses
Awards
Version 0.4, presented at the "Symposium on Spatial and Temporal Databases" 2011, which included various methods for spatial outlier detection,[9] won the conference's "best demonstration paper award".
Included algorithms
Select included algorithms:[10]
- Cluster analysis:
- K-means clustering (including fast algorithms such as Elkan, Hamerly, Annulus, and Exponion k-Means, and robust variants such as k-means--)
- K-medians clustering
- K-medoids clustering (PAM) (including FastPAM and approximations such as CLARA, CLARANS)
- Expectation-maximization algorithmfor Gaussian mixture modeling
- Hierarchical clustering (including the fast SLINK, CLINK, NNChain and Anderberg algorithms)
- Single-linkage clustering
- Leader clustering
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise, with full index acceleration for arbitrary distance functions)
- OPTICS (Ordering Points To Identify the Clustering Structure), including the extensions OPTICS-OF, DeLi-Clu, HiSC, HiCO and DiSH
- HDBSCAN
- Mean-shift clustering
- BIRCH clustering
- SUBCLU (Density-Connected Subspace Clustering for High-Dimensional Data)
- CLIQUE clustering
- ORCLUS and PROCLUS clustering
- COPAC, ERiC and 4C clustering
- CASH clustering
- DOC and FastDOC subspace clustering
- P3C clustering
- Canopy clustering algorithm
- Anomaly detection:
- k-Nearest-Neighbor outlier detection
- LOF(Local outlier factor)
- LoOP (Local Outlier Probabilities)
- OPTICS-OF
- DB-Outlier (Distance-Based Outliers)
- LOCI (Local Correlation Integral)
- LDOF (Local Distance-Based Outlier Factor)
- EM-Outlier
- SOD (Subspace Outlier Degree)
- COP (Correlation Outlier Probabilities)
- Frequent Itemset Mining and association rule learning
- Apriori algorithm
- Eclat
- FP-growth
- Dimensionality reduction
- Spatial indexstructures and other search indexes:
- Evaluation:
- F1 score, Average Precision
- Receiver operating characteristic (ROC curve)
- Discounted cumulative gain (including NDCG)
- Silhouette index
- Davies–Bouldin index
- Dunn index
- Density-based cluster validation (DBCV)
- Visualization
- Scatter plots
- Histograms
- Parallel coordinates (also in 3D, using OpenGL)
- Other:
- Statistical distributions and many parameter estimators, including robust MAD based and L-moment based estimators
- Dynamic time warping
- Change point detection in time series
- Intrinsic dimensionality estimators
Version history
Version 0.1 (July 2008) contained several Algorithms from
Version 0.2 (July 2009) added functionality for
Version 0.3 (March 2010) extended the choice of anomaly detection algorithms and visualization modules.[13]
Version 0.4 (September 2011) added algorithms for geo data mining and support for multi-relational database and index structures.[9]
Version 0.5 (April 2012) focuses on the evaluation of cluster analysis results, adding new visualizations and some new algorithms.[14]
Version 0.6 (June 2013) introduces a new 3D adaption of parallel coordinates for data visualization, apart from the usual additions of algorithms and index structures.[15]
Version 0.7 (August 2015) adds support for uncertain data types, and algorithms for the analysis of uncertain data.[16]
Version 0.7.5 (February 2019) adds additional clustering algorithms, anomaly detection algorithms, evaluation measures, and indexing structures.[17]
Version 0.8 (October 2020) adds automatic index creation, garbage collection, and incremental priority search, as well as many more algorithms such as BIRCH.[18]
Similar applications
- scikit-learn: machine learning library in python
- classificationalgorithms
- RapidMiner: An application available commercially (a restricted version is available as open source)
- KNIME: An open source platform which integrates various components for machine learning and data mining
See also
References
- ^ Hans-Peter Kriegel, Peer Kröger, Arthur Zimek (2009). "Outlier Detection Techniques (Tutorial)" (PDF). 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2009). Bangkok, Thailand. Retrieved 2010-03-26.
{{cite journal}}
: CS1 maint: multiple names: authors list (link) - PMID 26909165.
- ISSN 0302-9743.
- ISBN 978-1-62410-426-8.
- PMID 27178785.
- S2CID 1297145.
- S2CID 40772241.
- ^ "DBIDs". ELKI homepage. Retrieved 13 December 2016.
- ^ doi:10.1007/978-3-642-22922-0_41.)
{{cite conference}}
: CS1 maint: multiple names: authors list (link - ^ excerpt from "Data Mining Algorithms in ELKI". Retrieved 17 October 2019.
- doi:10.1007/978-3-540-69497-7_41.)
{{cite conference}}
: CS1 maint: multiple names: authors list (link - doi:10.1007/978-3-642-02982-0_35.)
{{cite conference}}
: CS1 maint: multiple names: authors list (link - doi:10.1007/978-3-642-12098-5_34.)
{{cite conference}}
: CS1 maint: multiple names: authors list (link - doi:10.1109/ICDE.2012.128.)
{{cite conference}}
: CS1 maint: multiple names: authors list (link - doi:10.1145/2463676.2463696.)
{{cite conference}}
: CS1 maint: multiple names: authors list (link - .
- arXiv:1902.03616 [cs.LG].
- .
External links
- Official website of ELKI with download and documentation.