Unsupervised Learning from Multi-Dimensional Data: A Fast Clustering Algorithm Utilizing Canopies and Statistical Information


ÖZCAN G.

INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & DECISION MAKING, vol.17, no.3, pp.841-856, 2018 (Peer-Reviewed Journal) identifier identifier

  • Publication Type: Article / Article
  • Volume: 17 Issue: 3
  • Publication Date: 2018
  • Doi Number: 10.1142/s0219622018500141
  • Journal Name: INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & DECISION MAKING
  • Journal Indexes: Science Citation Index Expanded, Scopus
  • Page Numbers: pp.841-856

Abstract

In this study, we consider unsupervised learning from multi-dimensional dataset problem. Particularly, we consider k-means clustering which require long duration time during execution of multi-dimensional datasets. In order to speed up clustering in an accurate form, we introduce a new algorithm, that we term Canopy+. The algorithm utilizes canopies and statistical techniques. Also, its efficient initiation and normalization methodologies contributes to the improvement. Furthermore, we consider early termination cases of clustering computation, provided that an intermediate result of the computation is accurate enough. We compared our algorithm with four popular clustering algorithms. Results denote that our algorithm speeds up the clustering computation by at least 2X. Also, we analyzed the contribution of early termination. Results present that further 2X improvement can be obtained while incurring 0.1% error rate. We also observe that our Canopy+ algorithm benefits from early termination and introduces extra 1.2X performance improvement.