Real-Time Fine-Grained Assembly Task Recognition Using Object Detection and 3-D Hand Skeleton Data-Based Deep Learning Classifier for Operator Efficiency

Ay, Oznur; EMEL, ERDAL

doi:10.1109/access.2025.3554263

Real-Time Fine-Grained Assembly Task Recognition Using Object Detection and 3-D Hand Skeleton Data-Based Deep Learning Classifier for Operator Efficiency

Atıf İçin Kopyala

Ay O., EMEL E.

IEEE Access, 2025 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Basım Tarihi: 2025
Doi Numarası: 10.1109/access.2025.3554263
Dergi Adı: IEEE Access
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, Directory of Open Access Journals
Anahtar Kelimeler: Assembly action recognition, connected worker, long-short term memory, object detection, pose estimation, time series classification
Bursa Uludağ Üniversitesi Adresli: Evet

Özet

Real-time monitoring for evaluating manual assembly work is essential for improving operator efficiency and product quality. Recent applications detect and prevent operator errors with instant feedback by visually recognizing actions in a monitored assembly scene. However, prior studies often disregarded hand-object interactions and lacked the modeling of fine-grained hand movements. Because industrial assembly tasks are primarily performed by hand, the focus should be on the hands and their interactions with manipulated tools and objects. This paper proposes a real-time fine-grained assembly task recognition system using 3-dimensional hand skeleton data extracted from streaming 2-dimensional video frames. The hybrid task recognition system consists of a You-Only-Look-Once (YOLO) deep object detection method and a Long Short-Term Memory (LSTM) based classifier working in an integrated approach. First, from the streaming data, the possible starting point for each sequential task is determined using the object detection method. Time series skeleton data were then captured using a pose estimation algorithm from the possible starting point until YOLOv8 detected a different start point. Subsequently, the proposed LSTM-based network classifies the time series of the hand joint coordinates to comply with the corresponding fine-grained assembly task. Sequential tasks at an industrial assembly station are used to create an operator-centric video dataset with annotations to evaluate the proposed system. The proposed hybrid system significantly improved the operator efficiency for sequential assembly tasks, achieving 85.23% accuracy in real-time task recognition. The real-world industrial assembly dataset used in our study was also shared as open access for the assembly task recognition community.