Unsupervised Learning from Multi-Dimensional Data: A Fast Clustering Algorithm Utilizing Canopies and Statistical Information


ÖZCAN G.

INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & DECISION MAKING, cilt.17, sa.3, ss.841-856, 2018 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 17 Sayı: 3
  • Basım Tarihi: 2018
  • Doi Numarası: 10.1142/s0219622018500141
  • Dergi Adı: INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & DECISION MAKING
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus
  • Sayfa Sayıları: ss.841-856
  • Anahtar Kelimeler: Data mining, multi-dimensional datasets, k-means clustering, canopies, normalization, early termination
  • Bursa Uludağ Üniversitesi Adresli: Evet

Özet

In this study, we consider unsupervised learning from multi-dimensional dataset problem. Particularly, we consider k-means clustering which require long duration time during execution of multi-dimensional datasets. In order to speed up clustering in an accurate form, we introduce a new algorithm, that we term Canopy+. The algorithm utilizes canopies and statistical techniques. Also, its efficient initiation and normalization methodologies contributes to the improvement. Furthermore, we consider early termination cases of clustering computation, provided that an intermediate result of the computation is accurate enough. We compared our algorithm with four popular clustering algorithms. Results denote that our algorithm speeds up the clustering computation by at least 2X. Also, we analyzed the contribution of early termination. Results present that further 2X improvement can be obtained while incurring 0.1% error rate. We also observe that our Canopy+ algorithm benefits from early termination and introduces extra 1.2X performance improvement.