Classification of breast cancer using ensemble machine learning with apache spark


Krotha D. P., Shaik F., Lakshmi G. J.

Sigma Journal of Engineering and Natural Sciences, cilt.43, sa.4, ss.1385-1399, 2025 (ESCI, Scopus) identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 43 Sayı: 4
  • Basım Tarihi: 2025
  • Doi Numarası: 10.14744/sigma.2025.00126
  • Dergi Adı: Sigma Journal of Engineering and Natural Sciences
  • Derginin Tarandığı İndeksler: Emerging Sources Citation Index (ESCI), Scopus, Academic Search Premier, Directory of Open Access Journals
  • Sayfa Sayıları: ss.1385-1399
  • Anahtar Kelimeler: Breast Cancer, Classification, Ensemble Methods, Feature Selection, Machine Learning, Spark
  • Bursa Uludağ Üniversitesi Adresli: Evet

Özet

Breast cancer is one of the most common and serious problem affecting people around the world. Detecting it early and correctly identifying whether a tumor is benign or malignant. In this study, we developed a new model called the Logistic Ensemble Fusion Model to im-prove the accuracy of Breast cancer diagnosis. This model combines the strengths of three different machine learning models, specifically Support Vector Machine, Decision Tree, and Logistic Regression, into a powerful ensemble approach, significantly improving over traditional methods. We used Apache Spark with its Python API to handle large datasets quickly and efficiently. To select the important features for making predictions, we used a method called Recursive Feature Elimination (RFE), with the help of both a Support Vector Machine (SVM-RFE) and Random Forest (RF-RFE). We tested our model by dividing the data into training and testing sets in an 80:20 ratio. The Logistic Ensemble Fusion Model achieved an accuracy of 99.13%, precision of 98.71%, recall of 99.91%, and an F1 score of 99.12%. The en-tire process, which involved running 12 Spark jobs, was completed in 38 seconds. Compared to other models like Random Forest, Gradient Boosting, Factorization Machine, One-vs-Rest, and Multilayer Perceptron. The main innovation of this study is the use of multiple machine learning models in a unified ensemble fusion approach, providing classification performance and demonstrating significant advancement over previous methods. This study underscores the potential of advanced ensemble machine learning techniques and big data technologies in refining breast cancer diagnosis and supporting more effective clinical decision-making.