logo

SCIENTIA SINICA Informationis, Volume 51 , Issue 3 : 399(2021) https://doi.org/10.1360/SSI-2020-0235

Attentive pooling for group activity recognition

More info
  • ReceivedAug 2, 2020
  • AcceptedNov 6, 2020
  • PublishedFeb 25, 2021

Abstract


Funded by

科技创新2030 — “新一代人工智能"重大项目(2018AAA0102100)


References

[1] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. In: Proceedings of Advances in Neural Information Processing Systems (NeuPS2014), 2014. 568--576. Google Scholar

[2] Wang L M, Xiong Y J, Wang Z, et al. Temporal segment networks: towards good practices for deep action recognition. In: Proceedings of European Conference on Computer Vision (ECCV), 2016. 20--36. Google Scholar

[3] Ji S, Xu W, Yang M. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Trans Pattern Anal Mach Intell, 2013, 35: 221-231 CrossRef Google Scholar

[4] Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 4489--4497. Google Scholar

[5] Yu S, Cheng Y, Xie L. Fully convolutional networks for action recognition. IET Comput Vision, 2017, 11: 744-749 CrossRef Google Scholar

[6] Varol G, Laptev I, Schmid C. Long-Term Temporal Convolutions for Action Recognition. IEEE Trans Pattern Anal Mach Intell, 2018, 40: 1510-1517 CrossRef Google Scholar

[7] Choi W, Savarese S. A unified framework for multi-target tracking and collective activity recognition. In: Proceedings of European Conference on Computer Vision (ECCV), 2012. 215--230. Google Scholar

[8] Ibrahim M, Muralidharan S, Deng Z W, et al. A hierarchical deep temporal model for group activity recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 1971--1980. Google Scholar

[9] Shu T M, Todorovic S, Zhu S C. Cern: confidence-energy recurrent network for group activity recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 5523--5531. Google Scholar

[10] Li X, Chuah M C. Sbgar: Semantics based group activity recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 2876--2885. Google Scholar

[11] Bagautdinov T M, Alahi A, Fleuret F, et al. Social scene understanding: end-to-end multi-person action localization and collective activity recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 3425--3434. Google Scholar

[12] Wang M S, Ni B B, Yang X K. Recurrent modeling of interaction context for collective activity recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 3048--3056. Google Scholar

[13] Biswas S, Gall J. Structural recurrent neural network (srnn) for group activity analysis. In: Proceedings of IEEE Winter Conference on Applications of Computer Vision (WACV), 2018. 1625--1632. Google Scholar

[14] Ibrahim M S, Muralidharan S, Deng Z W, et al. Hierarchical deep temporal models for group activity recognition. 2016,. arXiv Google Scholar

[15] Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Computation, 1997, 9: 1735-1780 CrossRef Google Scholar

[16] Herath S, Harandi M, Porikli F. Going deeper into action recognition: A survey. Image Vision Computing, 2017, 60: 4-21 CrossRef Google Scholar

[17] Chaquet J M, Carmona E J, Fernández-Caballero A. A survey of video datasets for human action and activity recognition. Comput Vision Image Understanding, 2013, 117: 633-659 CrossRef Google Scholar

[18] Nabi M, Bue A, Murino V. Temporal poselets for collective activity detection and recognition. In: Proceedings of IEEE Conference on Computer Vision Workshops (ICCV Workshops), 2013. 500--507. Google Scholar

[19] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of Advances in Neural Information Processing Systems (NIPS 2017), 2017. 5998--6008. Google Scholar

[20] Mnih V, Heess N, Graves A , et al. Recurrent models of visual attention. In: Proceedings of Advances in Neural Information Processing Systems (NIPS 2014), 2014. 2204--2212. Google Scholar

[21] Sharma S, Kiro R, Salakhutdinov R. Action recognition using visual attention. 2015,. arXiv Google Scholar

[22] Yang Z C, Yang D Y, Dyer C, et al. Hierarchical attention networks for document classification. In: Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016. 1480--1489. Google Scholar

[23] Long X, Gan C, de Melo G, et al. Attention clusters: purely attention based local feature integration for video classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 7834--7843. Google Scholar

[24] Lin Z S, Feng M W, Yu M, et al. A structured self-attentive sentence embedding. 2017,. arXiv Google Scholar

[25] King, Davis E. Dlib-ml: A machine learning toolkit'. Journal of Machine Learning Research, 2009,10,(Jul),pp. 1755-1758, doi: 10.1145/1577069.1755843. Google Scholar

[26] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems (NIPS 2012), 2012. 1097--1105. Google Scholar

[27] Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th international conference on artificial intelligence ana statistics (AISTATS-2010), 2010. 249--256. Google Scholar

[28] Kingma D, Ba J. Adam: a method for stochastic optimization. 2014,. arXiv Google Scholar

  • Figure 1

    (Color online) Example of left winpoint

  • Figure 2

    (Color online) Example of right spike

  • Figure 3

    (Color online) Framework of attentive pooling based model for group activity recognition

  • Figure 4

    (Color online) Global attentive pooling at time-step $t$

  • Figure 5

    (Color online) Hierarchical attentive pooling at time-step $t$

  • Figure 6

    (Color online) Attentive pooling with subgroups concatenation

  • Figure 7

    (Color online) Visualization of experiment results

  • Figure 8

    Confusion matrix of experiment results

  • Table 1   Model comparison
    Model ID Model name Description
    B1 HTM-max pooling Baseline, proposed by [8]
    B2 HTM-avg pooling Replace pooling mechanism in B1 with average pooling
    B3 HTM-GAP Combine the HTM model with GAP
    B4 HTM-HAP Combine the HTM model with HAP
    B5 HTM-APSC Combine the HTM model with APSC
  • Table 2   Results of ablation study on The Volleyball Dataset
    Type MethodAccuracy (%)
    Integration B1-HTM-max pooling 70.3
    B2-HTM-avg pooling 68.5
    B3-HTM-GAP(Ours) 74.2
    B4-HTM-HAP(Ours) 77.7
    Concantenation B1-HTM-max pooling 81.9
    B2-HTM-avg pooling 80.7
    B5-HTM-APSC(Ours) 84.5
  • Table 3   Comparison of experiment results on The Volleyball Dataset
    TypeMethod Accuracy (%)
    IntegrationB1-Hierarchical LSTM [14]
    CERN (Integration) [9] 73.5
    SBGAR [10] 66.9
    SRNN (Integration) [13] 74.4
    Ours-B3 77.7
    ConcantenationB1-Hierarchical LSTM [14] 81.9
    CERN (Concantenation) [9] 83.3
    SRNN (Concantenation) [13] 83.4
    Ours-B5 84.5
  • Table 4   Impact of MLP hidden units in attention module
    Hidden layer dim Method Accuracy (%)
    1024 HTM-GAP 69.0
    HTM-HAP 77.7
    HTM-APSC 83.1
    512 HTM-GAP 74.2
    HTM-HAP 77.4
    HTM-APSC 83.4
    256 HTM-GAP 73.3
    HTM-HAP 77.4
    HTM-APSC 82.4