Emerging Trends in Engineering & Technology, International Conference on
Download PDF

Abstract

To further address the issue of low accuracy in behavior recognition and localization in the spatiotemporal action detection task, this paper proposes YOWO-Res, a better CNN architecture with residual network and SIoU loss. YOWO-Res, like YOWO, uses 3D CNN and 2D CNN architectures to complete spatiotemporal action detection tasks. However, YOWO-Res differs from YOWO in three significant ways: we utilize the 3D CNN branch solely for action recognition, while the 2D CNN branch is solely used for action localization; we improve the 2D CNN by employing a residual network to deepen the depth to extract more effective features; we utilize the SIoU loss function for bounding box regression, introducing directionality into the loss function to achieve faster and more accurate model convergence. On the UCF-Sports and JHMDB-21 datasets, our model achieves F-mAP scores of 92.99% and 78.16% respectively, showing improvements of 4.79% and 3.76%. When the input is a clip of 16 frames, we achieved localization accuracy of 94.5% and 98.0% and action recognition accuracy of 94.7% and 86.2% respectively. On JHMDB-21 dataset, we obtain Video-Map of 88.2%,86.6%and 65.5% at IoU thresholds of 0.2, 0.5 and 0.75.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles