2018 24th International Conference on Pattern Recognition (ICPR)
Download PDF

Abstract

The temporal events in video sequences often have long-term dependencies which are difficult to be handled by a convolutional neural network (CNN). Especially, the dense pixel-wise prediction of video frames is a difficult problem for the CNN because huge memories and a large number of parameters are needed to learn the temporal correlation. To overcome these difficulties, we propose a recurrent encoder-decoder network which compresses the spatiotemporal features at the encoder and restores them to the original sized results at the decoder. We adopt a convolutional long short-term memory (LSTM) into the encoder-decoder architecture, which successfully learns the spatiotemporal relation with relatively a small number of parameters. The proposed network is applied to one of the dense pixel-prediction problems, specifically, the background subtraction in video sequences. The proposed network is trained with limited duration of video frames, and yet it shows good generalization performance for different videos and time duration. Also, by additional video specific learning, it shows the best performance on a benchmark dataset (CDnet 2014).
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles