Abstract
Current state-of-the-art news video retrieval systems mainly focus on automated speech recognition (ASR) text to perform retrieval. This paradigm greatly affects retrieval performance as ASR text alone is not sufficient to provide an accurate representation of the entire news video. In this paper, we describe our automated retrieval framework which fuses the multimodal features and event structures present in news video to support precise news video retrieval. The contributions of this paper are: (a) we uncover and employ temporal event clusters to provide additional information during story level retrieval; and (b) we integrate other modality features with text features and incorporate event clusters for pseudo relevance feedback (PRF) in shot level re-ranking. Experiments performed on video search task using the TRECVID 2005/06 dataset show that the proposed approach is effective.