DynamicSeq2SeqXGB for PM2.5 imputation in extremely sparse environmental monitoring networks


Safarov R., Shomanova Z., Nossenko Y., Kopishev E., Bexeitova Z., ATASOY E.

PLOS ONE, cilt.20, sa.12 December, 2025 (SCI-Expanded, Scopus) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 20 Sayı: 12 December
  • Basım Tarihi: 2025
  • Doi Numarası: 10.1371/journal.pone.0338788
  • Dergi Adı: PLOS ONE
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, BIOSIS, Chemical Abstracts Core, EMBASE, Index Islamicus, Linguistic Bibliography, MEDLINE, Psycinfo, zbMATH, Directory of Open Access Journals
  • Bursa Uludağ Üniversitesi Adresli: Evet

Özet

Environmental monitoring networks face critical data gaps that compromise public health protection and regulatory compliance, with missing data rates often exceeding 40% in operational settings. This study validates DynamicSeq2SeqXGB, a novel hybrid model that integrates a sequence-to-sequence encoder–decoder for temporal pattern extraction with an XGBoost regressor for robust gap reconstruction under extreme sparsity. Data from five monitoring stations in Pavlodar, Kazakhstan, collected over a 15-month period from May 23, 2024 to July 19, 2025, were analyzed representing severely compromised infrastructure (completeness rates 23.3–57.5%). The methodology employs adaptive context processing and implements hierarchical decomposition for extended outages. Two data preparation strategies were evaluated: selective compression applying quality thresholds versus full compression retaining all available observations. Benchmarking against classical methods using synthetic gaps of 5–72 hours demonstrated DynamicSeq2SeqXGB’s superiority in 96% of cases under full compression and 100% under selective compression (average 48.8% improvement for both strategies) with corresponding MAE values of 3.7–8.5 μg/m3 across the Pavlodar stations. Notably, full and selective compression showed equal overall effectiveness (50% win rate each), with optimal strategy depending on station-specific characteristics. External validation on the Beijing dataset (Guanyuan station, 2016) with controlled degradation confirmed cross-regional transferability, achieving MAE of 8.50 μg/m3 and coefficient of determination (R2) of 0.944 (68–79% improvement over baselines). The method successfully reconstructed PM2.5 time series even at 23.3% completeness, demonstrating robust performance for operational deployment in severely degraded monitoring networks.