2022 IEEE International Conference on Multimedia and Expo (ICME)
Download PDF

Abstract

The challenge of Video super-resolution (VSR) is how to make full use of the spatial-temporal coherence among neigh-bouring LR frames to generate high-resolution (HR) prediction. In this study, we propose to use transformer on VSR to capture long-range temporal dependencies. Specifically, we first spatially divide LR images into patches and split each patch into sub-patches. Transformer encoders are applied to both the patches and sub-patches, such that the self-attention modules can extract both global and local correlations. To accelerate the training process and filter out irrelevant features, we only select top-k similar features for the attention scheme. We then feed the extracted long-range correlations into a temporal, spatial and channel attention fusion mod-ule’ which enhances the useful information along all three di-mensions' respectively. Extensive experiments on benchmark datasets show that the proposed model outperforms state-of-the-art VSR methods in terms of PSNR/SSIM values and vi-sual qualities.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles