This is published under the terms of CC BY-NC.
Downloaded: 185 times
To learn from the numerous unlabeled data for smart infrastructure, we propose Enhanced Multi-Task Self-Supervised Learning (EMS2L) for self-supervised action recognition based on 3D human skeleton. With EMS2L, multiple self-supervised tasks are integrated to learn more comprehensive information, which is different from previous methods in which a single self-supervised task is manipulated. The self-supervised tasks employed here include task-specific methods (i.e., motion prediction and jigsaw puzzle task) and task-agnostic methods such as contrastive learning. Through the combination of these three self-supervised tasks, we can learn rich feature representations. Specifically, motion prediction is applied to extract detailed information by reconstructing original data from temporally masked and noisy sequences. Jigsaw puzzle makes the learned model capable of exploring temporal discriminative features for human action recognition by predicting the correct orders of shuffled sequences. Besides, to standardize the feature space, we utilize contrastive learning to constrain feature learning to increase the compactness within the class and separability between classes. To learn invariant representations, an attention model is proposed for contrastive representation learning to reduce the distance between original features and attention features. To avoid the performance degradation of network representation due to the pursuit of excessive invariance, this attention-based contrastive learning gives different degrees of weights to the features of different transformed data. Under a variety of settings, including fully-supervised, semi-supervised, unsupervised, and transfer learning, we evaluate EMS2L with downstream tasks. We also explore different network architectures (i.e., GRU GCN). The remarkable results on NW-UCLA, NTU RGB+D, and PKUMMD datasets illustrate the generality of our approach. With sufficient and extensive experiments, the advantage of our method is demonstrated by learning features that are more general and discriminative. Besides, we further provide more experimental analysis for different self-supervised tasks.
APSIPA Transactions on Signal and Information Processing Special Issue - Emerging AI Technologies for Smart Infrastructure
See the other articles that are part of this special issue.