IEEE Access, 2025 (SCI-Expanded)
Real-time monitoring for evaluating manual assembly work is essential for improving operator efficiency and product quality. Recent applications detect and prevent operator errors with instant feedback by visually recognizing actions in a monitored assembly scene. However, prior studies often disregarded hand-object interactions and lacked the modeling of fine-grained hand movements. Because industrial assembly tasks are primarily performed by hand, the focus should be on the hands and their interactions with manipulated tools and objects. This paper proposes a real-time fine-grained assembly task recognition system using 3-dimensional hand skeleton data extracted from streaming 2-dimensional video frames. The hybrid task recognition system consists of a You-Only-Look-Once (YOLO) deep object detection method and a Long Short-Term Memory (LSTM) based classifier working in an integrated approach. First, from the streaming data, the possible starting point for each sequential task is determined using the object detection method. Time series skeleton data were then captured using a pose estimation algorithm from the possible starting point until YOLOv8 detected a different start point. Subsequently, the proposed LSTM-based network classifies the time series of the hand joint coordinates to comply with the corresponding fine-grained assembly task. Sequential tasks at an industrial assembly station are used to create an operator-centric video dataset with annotations to evaluate the proposed system. The proposed hybrid system significantly improved the operator efficiency for sequential assembly tasks, achieving 85.23% accuracy in real-time task recognition. The real-world industrial assembly dataset used in our study was also shared as open access for the assembly task recognition community.