2024 IEEE International Conference on Multimedia and Expo (ICME)
Download PDF

Abstract

Accurately detecting emotions in conversation is a necessary yet challenging task due to the complexity of emotions and dynamics in dialogues. The emotional state of a speaker can be influenced by many different factors, such as interlocutor stimulus, dialogue scene, and topic. In this work, we propose a conversational speech emotion recognition method to deal with capturing attentive contextual dependency and speaker-sensitive interactions. First, we use a pretrained WavLM model to extract frame-based audio representation in individual utterances. Second, an attentive bi-directional gated recurrent unit (GRU) models contextual-sensitive information and explores listener dependency and speaker influence jointly in a simple, fast, parameter-efficient way. The experiments conducted on the standard conversational dataset MELD demonstrate the effectiveness of the proposed method when compared against state-of the-art methods.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles