now publishers - EMS2L: Enhanced Multi-Task Self-Supervised Learning for 3D Skeleton Representation Learning

APSIPA Transactions on Signal and Information Processing > Vol 12 > Issue 4

EMS²L: Enhanced Multi-Task Self-Supervised Learning for 3D Skeleton Representation Learning

Lilang Lin, Wangxuan Institute of Computer Technology, Peking University, China, Jiaying Liu, Wangxuan Institute of Computer Technology, Peking University, China, liujiaying@pku.edu.cn

Suggested Citation

Lilang Lin and Jiaying Liu (2023), "EMS²L: Enhanced Multi-Task Self-Supervised Learning for 3D Skeleton Representation Learning", APSIPA Transactions on Signal and Information Processing: Vol. 12: No. 4, e100. http://dx.doi.org/10.1561/116.00000022

Publication Date: 15 May 2023

Subjects

Keywords

Self-supervised learning, skeleton-based action recognition, multi-task learning

Journal details

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 1446 times

In this article:

Abstract

To learn from the numerous unlabeled data for smart infrastructure, we propose Enhanced Multi-Task Self-Supervised Learning (EMS²L) for self-supervised action recognition based on 3D human skeleton. With EMS²L, multiple self-supervised tasks are integrated to learn more comprehensive information, which is different from previous methods in which a single self-supervised task is manipulated. The self-supervised tasks employed here include task-specific methods (i.e., motion prediction and jigsaw puzzle task) and task-agnostic methods such as contrastive learning. Through the combination of these three self-supervised tasks, we can learn rich feature representations. Specifically, motion prediction is applied to extract detailed information by reconstructing original data from temporally masked and noisy sequences. Jigsaw puzzle makes the learned model capable of exploring temporal discriminative features for human action recognition by predicting the correct orders of shuffled sequences. Besides, to standardize the feature space, we utilize contrastive learning to constrain feature learning to increase the compactness within the class and separability between classes. To learn invariant representations, an attention model is proposed for contrastive representation learning to reduce the distance between original features and attention features. To avoid the performance degradation of network representation due to the pursuit of excessive invariance, this attention-based contrastive learning gives different degrees of weights to the features of different transformed data. Under a variety of settings, including fully-supervised, semi-supervised, unsupervised, and transfer learning, we evaluate EMS²L with downstream tasks. We also explore different network architectures (i.e., GRU GCN). The remarkable results on NW-UCLA, NTU RGB+D, and PKUMMD datasets illustrate the generality of our approach. With sufficient and extensive experiments, the advantage of our method is demonstrated by learning features that are more general and discriminative. Besides, we further provide more experimental analysis for different self-supervised tasks.

DOI:10.1561/116.00000022

Related publications

Companion

APSIPA Transactions on Signal and Information Processing Special Issue - Emerging AI Technologies for Smart Infrastructure
See the other articles that are part of this special issue.

Introduction
Related Work
Enhanced Multiple Self-Supervised Learning
Experiment Results
Conclusion
References

EMS2L: Enhanced Multi-Task Self-Supervised Learning for 3D Skeleton Representation Learning

Share

Journal details

Abstract

Related publications

EMS²L: Enhanced Multi-Task Self-Supervised Learning for 3D Skeleton Representation Learning