DOI: 10.3390/robotics15070127 ISSN: 2218-6581

A Late-Fusion Multimodal Approach for Safety-Aware Workspace Modeling in Collaborative Robotic Systems

Kevin David Ortega-Quiñones, Elias Escobar-Pereira, Michael Felipe Cifuentes-Molano, Germán Andrés Holguín-Londoño, Mauricio Holguín-Londoño

Ensuring safe coexistence between human operators and industrial robot manipulators is a critical challenge in collaborative manufacturing environments. Existing approaches rely either on dedicated safety-rated hardware, which is expensive and difficult to retrofit, or on purely vision-based classifiers that discard the precise kinematic state available from the robot controller, leading to unresolved visual ambiguities when different joint configurations produce similar appearances from fixed camera viewpoints. Kinematics-only approaches, while precise, lack the spatial context needed to disambiguate configurations near workspace boundaries. We propose RGBJointsNet, a late-fusion multimodal deep learning classifier that combines RGB visual features extracted by a frozen EfficientNet-B2 convolutional backbone with a compact kinematic stream processing the 12-dimensional joint angle vector of a dual-UR5 robotic cell. The model maps each observation to one of five mutually exclusive workspace zones: rest (C0), nominal (C1), extended (C2), shared/collision-risk (C3), and joint-limit/singularity (C4). A dedicated simulation environment built on ROS 2 Humble Hawksbill and Gazebo Classic 11 was used to generate a labelled dataset of 54,309 frames and 162,927 RGB images from three calibrated overhead cameras, with analytic ground-truth labels derived from closed-form forward kinematics. Training on a CPU with a feature-caching strategy brings the per-epoch wall-clock time to seconds, making the approach tractable without GPU hardware. On the held-out test set, the model achieves 87.1% overall accuracy and a macro-averaged F1 score of 90.0%, with near-perfect recall of 99.3% for the safety-critical shared zone C3. The trained classifier is integrated as an ROS 2 inference node capable of running at 10Hz on a standard workstation. Our results demonstrate that joint angle information is a decisive complement to RGB imagery for fine-grained, safety-oriented workspace classification in simulation-derived settings.

More from our Archive