2021 IEEE International Conference on Multimedia and Expo (ICME)
Download PDF

Abstract

The unsupervised representation learning for skeleton-based human action can be utilized in a variety of pose analysis applications. However, previous unsupervised methods focus on modeling the temporal dependencies in sequences, but take less effort in modeling the spatial structure in human action. To this end, we propose a novel unsupervised learning frame-work named Hierarchical Transformer for skeleton-based human action recognition. The Hierarchical Transformer consists of hierarchically aggregated self-attention modules for better capturing the spatial and temporal structure in the skeleton sequences. Furthermore, we propose to predict the motion between adjacent frames as a novel pre-training task for better capturing the long-term dependencies in sequences. Experimental results show that our method outperforms prior state-of-the-art unsupervised methods on NTU RGB+D and NW-UCLA datasets. Besides, our method also achieves state-of-the-art performance when the pre-trained model is transferred to SBU dataset, which demonstrates the generalizability of learned representation.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles