Attention Fusion for Audio-Visual Person Verification Using Multi-Scale Features

Stefan Hörmann; Abdul Moiz; Martin Knoche; Gerhard Rigoll

doi:10.1109/FG47880.2020.00074

2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020) (FG)

Attention Fusion for Audio-Visual Person Verification Using Multi-Scale Features

Year: 2020, Volume: 1, Pages: 281-285

DOI Bookmark: 10.1109/FG47880.2020.00074

Authors

Stefan Hörmann, Technical University of Munich,Chair of Human-Machine Communication,Germany
Abdul Moiz, Technical University of Munich,Chair of Human-Machine Communication,Germany
Martin Knoche, Technical University of Munich,Chair of Human-Machine Communication,Germany
Gerhard Rigoll, Technical University of Munich,Chair of Human-Machine Communication,Germany

Abstract

In the domain of audio-visual person recognition, many approaches use naive fusion techniques, such as scorelevel fusion or concatenation, to fuse the features obtained by face and audio extraction networks. More sophisticated methods fuse both features taking into account the quality of their corresponding inputs. In this paper, we propose a novel architecture to improve the prediction of feature quality. In contrary to previous works, which estimate feature quality based on the features themselves, we combine the information obtained from different layers of the feature extraction networks. In our analysis, we show that our approach outperforms state-of-the-art fusion approaches on well-established benchmarks for multimodal person verification. Moreover, we show that our model is robust against degradation of the visual input.

Like what you’re reading?

Already a member?

Get this article FREE with a new membership!

Video Based Person Authentication via Audio/Visual Association
2006 IEEE International Conference on Multimedia and Expo
Audio-Video Person Authentication Based on 3D Facial Feature Warping
Digital Image Computing: Techniques and Applications (DICTA'05)
Audio-Visual Speaker Recognition via Multi-modal Correlated Neural Networks
2016 IEEE/WIC/ACM International Conference on Web Intelligence Workshops (WIW)
Fusion Based Emotion Recognition System
2016 International Conference on Computational Science and Computational Intelligence (CSCI)
Audio-visual person verification
Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149)
Multistage information fusion for audio-visual speech recognition
2004 IEEE International Conference on Multimedia and Expo (ICME)
Efficient Speaker Naming via Deep Audio-Face Fusion and End-to-End Attention Model
2017 4th IAPR Asian Conference on Pattern Recognition (ACPR)
Efficient Audio-Visual Speaker Recognition Via Deep Multi-Modal Feature Fusion
2021 17th International Conference on Computational Intelligence and Security (CIS)
Multimodal Person Verification based on audio-visual fusion
2022 10th International Conference on Information Systems and Computing Technology (ISCTech)
Mutual Cross-Attention in Dyadic Fusion Networks for Audio-Video Emotion Recognition
2023 11th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)

Attention Fusion for Audio-Visual Person Verification Using Multi-Scale Features

Authors

Abstract

Related Articles