2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020) (FG)
Download PDF

Abstract

In the domain of audio-visual person recognition, many approaches use naive fusion techniques, such as scorelevel fusion or concatenation, to fuse the features obtained by face and audio extraction networks. More sophisticated methods fuse both features taking into account the quality of their corresponding inputs. In this paper, we propose a novel architecture to improve the prediction of feature quality. In contrary to previous works, which estimate feature quality based on the features themselves, we combine the information obtained from different layers of the feature extraction networks. In our analysis, we show that our approach outperforms state-of-the-art fusion approaches on well-established benchmarks for multimodal person verification. Moreover, we show that our model is robust against degradation of the visual input.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles