logo

SCIENTIA SINICA Informationis, Volume 51 , Issue 5 : 695(2021) https://doi.org/10.1360/SSI-2020-0165

Video distillation

More info
  • ReceivedJun 6, 2020
  • AcceptedFeb 22, 2021
  • PublishedApr 13, 2021

Abstract


Funded by

国家重点研发计划(2018AAA0102201)

国家自然科学基金(61871470,61761130079,U1801262)

博士后科学基金(2020TQ0236)


References

[1] Li Q, Gkoumas D, Lioma C. Quantum-inspired multimodal fusion for video sentiment analysis. Inf Fusion, 2021, 65: 58-71 CrossRef Google Scholar

[2] Nie X, Lin P, Yang M. Hierarchical feature fusion hashing for near-duplicate video retrieval. Sci Sin-Inf, 2018, 48: 1697-1708 CrossRef Google Scholar

[3] Xiang S J, Yang J Q, Huang J W. Perceptual video hashing robust against geometric distortions. Sci China Inf Sci, 2012, 55: 1520-1527 CrossRef Google Scholar

[4] Ghatak S, Rup S, Didwania H. GAN based efficient foreground extraction and HGWOSA based optimization for video synopsis generation. Digital Signal Processing, 2021, 111: 102988 CrossRef Google Scholar

[5] Lokoc J, Soucek T, Vesely P, et al. A W2VV+ case study with automated and interactive text-to-video retrieval. In: Proceedings of the ACM International Conference on Multimedia, 2020. 2553--2561. Google Scholar

[6] Hu D, Li X, Nie F. Deep linear discriminant analysis hashing. Sci Sin-Inf, 2021, 51: 279 CrossRef Google Scholar

[7] Chieu H L, Ng H T. A maximum entropy approach to information extraction from semi-structured and free text. In: Proceedings of the 18th National Conference on Artificial Intelligence and Fourteenth Conference on Innovative Applications of Artificial Intelligence, Edmonton, 2002. 786--791. Google Scholar

[8] Camastra F, Vinciarelli A. Machine Learning for Audio, Image and Video Analysis: Theory and Applications. London: Springer, 2015. 295--336. Google Scholar

[9] Li X, Zhang H, Zhang R. Discriminative and Uncorrelated Feature Selection With Constrained Spectral Analysis in Unsupervised Learning. IEEE Trans Image Process, 2020, 29: 2139-2149 CrossRef PubMed ADS Google Scholar

[10] Gong H G, Li X L. 大数据系统综述. Sci Sin-Inf, 2015, 45: 1-44 CrossRef Google Scholar

[11] Li Z, Wang W, Li Z, et al. Towards visually explaining video understanding networks with perturbation. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2021. 65--76. Google Scholar

[12] Aafaq N, Mian A, Liu W. Video Description. ACM Comput Surv, 2020, 52: 1-37 CrossRef Google Scholar

[13] Li X, Song J, Gao L, et al. Beyond RNNs: positional self-attention with co-attention for video question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2019. 8658--8665. Google Scholar

[14] Li X, Zhao B, Lu X. Key Frame Extraction in the Summary Space.. IEEE Trans Cybern, 2018, 48: 1923-1934 CrossRef PubMed Google Scholar

[15] Schneider W J, Newman D A. Intelligence is multidimensional: Theoretical review and implications of specific cognitive abilities. Human Resource Manage Rev, 2015, 25: 12-27 CrossRef Google Scholar

[16] Hussain T, Muhammad K, Ding W. A comprehensive survey of multi-view video summarization. Pattern Recognition, 2021, 109: 107567 CrossRef Google Scholar

[17] Baskurt K B, Samet R. Video synopsis: A survey. Comput Vision Image Understanding, 2019, 181: 26-38 CrossRef Google Scholar

[18] Aafaq N, Akhtar N, Liu W, et al. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 12487--12496. Google Scholar

[19] Li X, Wang Z, Lu X. Video Synopsis in Complex Situations. IEEE Trans Image Process, 2018, 27: 3798-3812 CrossRef PubMed ADS Google Scholar

[20] Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. 886--893. Google Scholar

[21] Zhang K, Chao W, Sha F, et al. Video summarization with long short-term memory. In: Proceedings of the European Conference on Computer Vision, 2016. 766--782. Google Scholar

[22] Lowe D G. Distinctive Image Features from Scale-Invariant Keypoints. Int J Comput Vision, 2004, 60: 91-110 CrossRef Google Scholar

[23] Zhao B, Li X, Lu X. Hierarchical recurrent neural network for video summarization. In: Proceedings of the ACM Conference on Multimedia, 2017. 863--871. Google Scholar

[24] Venugopalan S, Rohrbach M, Donahue J, et al. Sequence to sequence - video to text. In: Proceedings of the IEEE International Conference on Computer Vision, 2015. 4534--4542. Google Scholar

[25] Pan P, Xu Z, Yang Y, et al. Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 1029--1038. Google Scholar

[26] Park H S, Jun C H. A simple and fast algorithm for K-medoids clustering. Expert Syst Appl, 2009, 36: 3336-3341 CrossRef Google Scholar

[27] de Avila S E F, Lopes A P B, da Luz Jr. A. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Lett, 2011, 32: 56-68 CrossRef Google Scholar

[28] Papadopoulos D P, Kalogeiton V S, Chatzichristofis S A. Automatic summarization and annotation of videos with lack of metadata information. Expert Syst Appl, 2013, 40: 5765-5778 CrossRef Google Scholar

[29] Chu W, Song Y, Jaimes A. Video co-summarization: video summarization by visual co-occurrence. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 3584--3592. Google Scholar

[30] Elhamifar E, Sapiro G, Vidal R. See all by looking at a few: sparse modeling for finding representative objects. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012. 1600--1607. Google Scholar

[31] Mei S, Guan G, Wang Z, et al. L2,0 constrained sparse dictionary selection for video summarization. In: Proceedings of the IEEE International Conference on Multimedia and Expo, 2014. 1--6. Google Scholar

[32] Ma M, Mei S, Wan S. Video summarization via block sparse dictionary selection. Neurocomputing, 2020, 378: 197-209 CrossRef Google Scholar

[33] Ma M, Mei S, Wan S. Video summarization via block sparse dictionary selection. Neurocomputing, 2020, 378: 197-209 CrossRef Google Scholar

[34] Zhao B, Xing E P. Quasi real-time summarization for consumer videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014. 2513--2520. Google Scholar

[35] Gong B, Chao W, Grauman K, et al. Diverse sequential subset selection for supervised video summarization. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, 2014. 2069--2077. Google Scholar

[36] Meng J, Wang H, Yuan J, et al. From keyframes to key objects: video summarization by representative object proposal selection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 1039--1048. Google Scholar

[37] Gygli M, Grabner H, van Gool L. Video summarization by learning submodular mixtures of objectives. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 3090--3098. Google Scholar

[38] Pan J, Yang H, Faloutsos C. MMSS: multi-modal story-oriented video summarization. In: Proceedings of the IEEE International Conference on Data Mining, 2004. 491--494. Google Scholar

[39] Li X, Zhao B, Lu X. A General Framework for Edited Video and Raw Video Summarization. IEEE Trans Image Process, 2017, 26: 3652-3664 CrossRef PubMed ADS Google Scholar

[40] Zhao B, Li X, Lu X. Property-Constrained Dual Learning for Video Summarization.. IEEE Trans Neural Netw Learning Syst, 2020, 31: 3989-4000 CrossRef PubMed Google Scholar

[41] Lee Y J, Ghosh J, Grauman K. Discovering important people and objects for egocentric video summarization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012. 1346--1353. Google Scholar

[42] Pritch Y, Rav-Acha A, Peleg S. Nonchronological video synopsis and indexing.. IEEE Trans Pattern Anal Mach Intell, 2008, 30: 1971-1984 CrossRef PubMed Google Scholar

[43] Jin J, Liu F, Gan Z, et al. Online video synopsis method through simple tube projection strategy. In: Proceedings of the 8th International Conference on Wireless Communications & Signal Processing, 2016. 1--5. Google Scholar

[44] Hoshen Y, Peleg S. Live video synopsis for multiple cameras. In: Proceedings of 2015 IEEE International Conference on Image Processing, 2015. 212--216. Google Scholar

[45] Taj M, Maggio E, Cavallaro A. Multi-feature graph-based object tracking. In: Proceedings of Multimodal Technologies for Perception of Humans, First International Evaluation Workshop on Classification of Events, Activities and Relationships, 2006. 190--199. Google Scholar

[46] He Y, Qu Z, Gao C. Fast Online Video Synopsis Based on Potential Collision Graph. IEEE Signal Process Lett, 2017, 24: 22-26 CrossRef ADS Google Scholar

[47] Sun L, Xing J, Ai H, et al. A tracking based fast online complete video synopsis approach. In: Proceedings of the 21st International Conference on Pattern Recognition, 2012. 1956--1959. Google Scholar

[48] Hsia C H, Chiang J S, Hsieh C F. Low-complexity range tree for video synopsis system. Multimed Tools Appl, 2016, 75: 9885-9902 CrossRef Google Scholar

[49] Zhu X, Liu J, Wang J. Key observation selection-based effective video synopsis for camera network. Machine Vision Appl, 2014, 25: 145-157 CrossRef Google Scholar

[50] Zhu X, Loy C C, Gong S. Learning from Multiple Sources for Video Summarisation. Int J Comput Vis, 2016, 117: 247-268 CrossRef Google Scholar

[51] Lin L, Lin W, Xiao W. An optimized video synopsis algorithm and its distributed processing model. Soft Comput, 2017, 21: 935-947 CrossRef Google Scholar

[52] Yan C, Tu Y, Wang X. STAT: Spatial-Temporal Attention Mechanism for Video Captioning. IEEE Trans Multimedia, 2020, 22: 229-241 CrossRef Google Scholar

[53] Felzenszwalb P F, Girshick R B, McAllester D. Object detection with discriminatively trained part-based models.. IEEE Trans Pattern Anal Mach Intell, 2010, 32: 1627-1645 CrossRef PubMed Google Scholar

[54] Moore D J, Essa I A. Recognizing multitasked activities from video using stochastic context-free grammar. In: Proceedings of the 18th National Conference on Artificial Intelligence and Fourteenth Conference on Innovative Applications of Artificial Intelligence, 2002. 770--776. Google Scholar

[55] Zhu S C, Mumford D. A Stochastic Grammar of Images. FNT Comput Graphics Vision, 2006, 2: 259-362 CrossRef Google Scholar

[56] Donahue J, Hendricks L A, Rohrbach M. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description.. IEEE Trans Pattern Anal Mach Intell, 2017, 39: 677-691 CrossRef PubMed Google Scholar

[57] Zhao B, Li X, Lu X. Video captioning with tube features. In: Proceedings of International Joint Conference on Artificial Intelligence, 2018. 1177--1183. Google Scholar

[58] Sah S, Nguyen T, Ptucha R. Understanding temporal structure for video captioning. Pattern Anal Applic, 2020, 23: 147-159 CrossRef Google Scholar

[59] Yao L, Torabi A, Cho K, et al. Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, 2015. 4507--4515. Google Scholar

[60] Li X, Zhao B, Lu X. MAM-RNN: multi-level attention model based RNN for video captioning. In: Proceedings of the International Joint Conference on Artificial Intelligence, 2017. 2208--2214. Google Scholar

[61] Li L, Gong B. End-to-end video captioning with multitask reinforcement learning. In: Proceedings of IEEE Winter Conference on Applications of Computer Vision, 2019. 339--348. Google Scholar

[62] Dai B, Fidler S, Urtasun R, et al. Towards diverse and natural image descriptions via a conditional GAN In: Proceedings of the IEEE International Conference on Computer Vision, 2017. 2989--2998. Google Scholar

[63] Rahman T, Xu B, Sigal L. Watch, listen and tell: multi-modal weakly supervised dense event captioning. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 2019. 8907--8916. Google Scholar

[64] Lu C, Shi J, Wang W. Fast Abnormal Event Detection. Int J Comput Vis, 2019, 127: 993-1011 CrossRef Google Scholar

[65] Basavarajaiah M, Sharma P. Survey of Compressed Domain Video Summarization Techniques. ACM Comput Surv, 2020, 52: 1-29 CrossRef Google Scholar

[66] Zhang Y, Liang X, Zhang D. Unsupervised object-level video summarization with online motion auto-encoder. Pattern Recognition Lett, 2020, 130: 376-385 CrossRef Google Scholar

[67] Wang W, Shen J, Lu X. Paying Attention to Video Object Pattern Understanding.. IEEE Trans Pattern Anal Mach Intell, 2020, : 1-1 CrossRef PubMed Google Scholar

[68] Wan S, Goudos S. Faster R-CNN for multi-class fruit detection using a robotic vision system. Comput Networks, 2020, 168: 107036 CrossRef Google Scholar

[69] Li X L, Dong Y S, Shi J H. 场景图像分类技术综述. Sci Sin-Inf, 2015, 45: 827-848 CrossRef Google Scholar

[70] Fooladgar F, Kasaei S. A survey on indoor RGB-D semantic segmentation: from hand-crafted features to deep convolutional neural networks. Multimed Tools Appl, 2020, 79: 4499-4524 CrossRef Google Scholar

[71] Ju Han , Kai-Kuang Ma . Fuzzy color histogram and its use in color image retrieval. IEEE Trans Image Process, 2002, 11: 944-952 CrossRef PubMed ADS Google Scholar

[72] Bay H, Tuytelaars T, Van Gool L. Surf: speeded up robust features. In: Proceedings of European Conference on Computer Vision. Berlin: Springer, 2006. 404--417. Google Scholar

[73] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems, 2012. 1106--1114. Google Scholar

[74] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proceedings of the International Conference on Learning Representations, 2015. Google Scholar

[75] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 3431--3440. Google Scholar

[76] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 770--778. Google Scholar

[77] Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 1--9. Google Scholar

[78] Huang G, Liu Z, van der Maaten L, et al. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 4700--4708. Google Scholar

[79] Deng J, Dong W, Socher R, et al. Imagenet: a large-scale hierarchical image database. In: Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009. 248--255. Google Scholar

[80] Kuznetsova A, Rom H, Alldrin N, et al. The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. 2018,. arXiv Google Scholar

[81] Lecun Y, Bottou L, Bengio Y. Gradient-based learning applied to document recognition. Proc IEEE, 1998, 86: 2278-2324 CrossRef Google Scholar

[82] Krizhevsky A, Hinton G, et al. Learning multiple layers of features from tiny images. 2009. 32--35. Google Scholar

[83] Zhou B, Lapedriza A, Xiao J, et al. Learning deep features for scene recognition using places database. In: Advances in Neural Information Processing Systems, 2014. 487--495. Google Scholar

[84] Soomro K, Zamir A R, Shah M. Ucf101: a dataset of 101 human actions classes from videos in the wild. 2012,. arXiv Google Scholar

[85] Kuehne H, Jhuang H, Garrote E, et al. Hmdb: a large video database for human motion recognition. In: Proceedings of 2011 International Conference on Computer Vision, 2011. 2556--2563. Google Scholar

[86] Monfort M, Andonian A, Zhou B, et al. Moments in time dataset: one million videos for event understanding. 2018,. arXiv Google Scholar

[87] Heilbron C, Escorcia V, Ghanem B, et al. Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 961--970. Google Scholar

[88] Kay W, Carreira J, Simonyan K, et al. The kinetics human action video dataset. 2017,. arXiv Google Scholar

[89] Xiaoyang Tan , Triggs B. Enhanced Local Texture Feature Sets for Face Recognition Under Difficult Lighting Conditions. IEEE Trans Image Process, 2010, 19: 1635-1650 CrossRef PubMed ADS Google Scholar

[90] Smith J R, Chang S. Automated binary texture feature sets for image retrieval. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 1996. 2239--2242. Google Scholar

[91] Song Y, Vallmitjana J, Stent A, et al. Tvsum: summarizing web videos using titles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 5179--5187. Google Scholar

[92] Hafner J, Sawhney H S, Equitz W. Efficient color histogram indexing for quadratic form distance functions. IEEE Trans Pattern Anal Machine Intell, 1995, 17: 729-736 CrossRef Google Scholar

[93] Weixian L, Xiaoping L, Mingli D. Golf video tracking based on recognition with HOG and spatial-temporal vector. Int J Adv Robotic Syst, 2017, 14: 172988141770454 CrossRef Google Scholar

[94] Sanal Kumar K P, Bhavani R. Human activity recognition in egocentric video using HOG, GiST and color features. Multimed Tools Appl, 2020, 79: 3543-3559 CrossRef Google Scholar

[95] Lee T, Hwangbo M, Alan T, et al. Low-complexity hog for efficient video saliency. In: Proceedings of 2015 IEEE International Conference on Image Processing (ICIP), 2015. 3749--3752. Google Scholar

[96] Zhu Y, Huang X, Huang Q. Large-scale video copy retrieval with temporal-concentration SIFT. Neurocomputing, 2016, 187: 83-91 CrossRef Google Scholar

[97] Battiato S, Gallo G, Puglisi G, et al. Sift features tracking for video stabilization. In: Proceedings of the 14th International Conference on Image Analysis and Processing (ICIAP 2007), 2007. 825--830. Google Scholar

[98] Li X, Chen M, Wang Q. Multiview-based group behavior analysis in optical image sequence. Sci Sin-Inf, 2018, 48: 1227-1241 CrossRef Google Scholar

[99] Mhaskar H N, Poggio T. Deep vs. shallow networks: An approximation theory perspective. Anal Appl, 2016, 14: 829-848 CrossRef Google Scholar

[100] Ren Z, Xu W. An improved path integration method for nonlinear systems under Poisson white noise excitation. Appl Math Computation, 2020, 373: 125036 CrossRef Google Scholar

[101] Zhou K, Qiao Y, Xiang T. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2018. 7582--7589. Google Scholar

[102] Yuan Y, Li H, Wang Q. Spatiotemporal Modeling for Video Summarization Using Convolutional Recurrent Neural Network. IEEE Access, 2019, 7: 64676-64685 CrossRef Google Scholar

[103] Yao T, Mei T, Rui Y. Highlight detection with pairwise deep ranking for first-person video summarization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 982--990. Google Scholar

[104] Michelucci U. Advanced Applied Deep Learning: Convolutional Neural Networks and Object Detection. New York: Apress, 2019. Google Scholar

[105] Gibson J, Marques O. Optical Flow and Trajectory Estimation Methods. Berlin: Springer, 2019. 9--21. Google Scholar

[106] Dosovitskiy A, Fischer P, Ilg E, et al. Flownet: learning optical flow with convolutional networks. In: Proceedings of IEEE International Conference on Computer Vision, 2015. 2758--2766. Google Scholar

[107] Ilg E, Mayer N, Saikia T, et al. Flownet 2.0: evolution of optical flow estimation with deep networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1647--1655. Google Scholar

[108] Ranjan A, Black M J. Optical flow estimation using a spatial pyramid network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2720--2729. Google Scholar

[109] Sun D, Yang X, Liu M, et al. PWC-net: CNNs for optical flow using pyramid, warping, and cost volume. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 8934--8943. Google Scholar

[110] Revaud J, Weinzaepfel P, Harchaoui Z, et al. Epicflow: edge-preserving interpolation of correspondences for optical flow. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015. 1164--1172. Google Scholar

[111] Ji S, Xu W, Yang M, et al. 3D convolutional neural networks for human action recognition. In: Proceedings of the 27th International Conference on Machine Learning, 2010. 495--502. Google Scholar

[112] Tran D, Bourdev L D, Fergus R, et al. C3D: generic features for video analysis. 2014,. arXiv Google Scholar

[113] Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 4724--4733. Google Scholar

[114] Hara K, Kataoka H, Satoh Y. Learning spatio-temporal features with 3D residual networks for action recognition. In: Proceedings of IEEE International Conference on Computer Vision Workshops, 2017. 3154--3160. Google Scholar

[115] Qiu Z, Yao T, Mei T. Learning spatio-temporal representation with pseudo-3D residual networks. In: Proceedings of IEEE International Conference on Computer Vision, 2017. 5534--5542. Google Scholar

[116] Hochreiter S, Schmidhuber J. Long short-term memory.. Neural Computation, 1997, 9: 1735-1780 CrossRef PubMed Google Scholar

[117] Cho K, van Merrienboer B, Bahdanau D, et al. On the properties of neural machine translation: encoder-decoder approaches. In: Proceedings of Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014. 103--111. Google Scholar

[118] Zhao B, Li X, Lu X. HSA-RNN: hierarchical structure-adaptive RNN for video summarization. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 7405--7414. Google Scholar

[119] Arabshahi F, Lu Z, Singh S, et al. Memory augmented recursive neural networks. 2019,. arXiv Google Scholar

[120] Ren Z, Xu W, Zhang S. Reliability analysis of nonlinear vibro-impact systems with both randomly fluctuating restoring and damping terms. Commun NOnlinear Sci Numer Simul, 2020, 82: 105087 CrossRef ADS Google Scholar

[121] Mahasseni B, Lam M, Todorovic S. Unsupervised video summarization with adversarial LSTM networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2982--2991. Google Scholar

[122] Bengio Y, Simard P, Frasconi P. Learning long-term dependencies with gradient descent is difficult.. IEEE Trans Neural Netw, 1994, 5: 157-166 CrossRef PubMed Google Scholar

[123] Hu Z, Nie F, Wang R. Multi-view spectral clustering via integrating nonnegative embedding and spectral embedding. Inf Fusion, 2020, 55: 251-259 CrossRef Google Scholar

[124] Nie F, Hu Z, Li X. Matrix Completion Based on Non-Convex Low-Rank Approximation. IEEE Trans Image Process, 2019, 28: 2378-2388 CrossRef PubMed ADS Google Scholar

[125] Chinrungrueng C, Sequin C H. Optimal adaptive k-means algorithm with dynamic adjustment of learning rate.. IEEE Trans Neural Netw, 1995, 6: 157-169 CrossRef PubMed Google Scholar

[126] Frey B J, Dueck D. Clustering by Passing Messages Between Data Points. Science, 2007, 315: 972-976 CrossRef PubMed ADS Google Scholar

[127] Rodriguez A, Laio A. Clustering by fast search and find of density peaks. Science, 2014, 344: 1492-1496 CrossRef PubMed ADS Google Scholar

[128] Ren J, Jiang J, Feng Y. Activity-driven content adaptation for effective video summarization. J Visual Communication Image Representation, 2010, 21: 930-938 CrossRef Google Scholar

[129] Khosla A, Hamid R, Lin C, et al. Large-scale video summarization using web-image priors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013. 2698--2705. Google Scholar

[130] Cong Y, Yuan J, Luo J. Towards Scalable Summarization of Consumer Videos Via Sparse Dictionary Selection. IEEE Trans Multimedia, 2012, 14: 66-75 CrossRef Google Scholar

[131] Gong Y, Liu X. Video summarization using singular value decomposition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2000. 174--180. Google Scholar

[132] Ma M, Mei S, Wan S. Video summarization via block sparse dictionary selection. Neurocomputing, 2020, 378: 197-209 CrossRef Google Scholar

[133] Wang S, Cong Y, Cao J. Scalable gastroscopic video summarization via similar-inhibition dictionary selection.. Artificial Intelligence Med, 2016, 66: 1-13 CrossRef PubMed Google Scholar

[134] Etezadifar P, Farsi H. Scalable video summarization via sparse dictionary learning and selection simultaneously. Multimed Tools Appl, 2017, 76: 7947-7971 CrossRef Google Scholar

[135] Marvaniya S, Damoder M, Gopalakrishnan V, et al. Real-time video summarization on mobile. In: Proceedings of IEEE International Conference on Image Processing, 2016. 176--180. Google Scholar

[136] Wang J, Wang Y, Zhang Z. Visual saliency based aerial video summarization by online scene classification. In: Proceedings of International Conference on Image and Graphics, 2011. 777--782. Google Scholar

[137] Yao T, Mei T, Rui Y. Highlight detection with pairwise deep ranking for first-person video summarization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 982--990. Google Scholar

[138] Atencio P, German S T, Branch J W. Video summarisation by deep visual and categorical diversity. IET Comput Vision, 2019, 13: 569-577 CrossRef Google Scholar

[139] Kaushal V, Iyer R, Doctor K, et al. Demystifying multi-faceted video summarization: tradeoff between diversity, representation, coverage and importance. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2019. 452--461. Google Scholar

[140] Ejaz N, Tariq T B, Baik S W. Adaptive key frame extraction for video summarization using an aggregation mechanism. J Visual Communication Image Representation, 2012, 23: 1031-1040 CrossRef Google Scholar

[141] Macchi O. The coincidence approach to stochastic point processes. Adv Appl Probab, 1975, 7: 83-122 CrossRef Google Scholar

[142] Liu D, Gang Hua D, Tsuhan Chen D. A hierarchical visual model for video object summarization.. IEEE Trans Pattern Anal Mach Intell, 2010, 32: 2178-2190 CrossRef PubMed Google Scholar

[143] Li X, Chen M, Nie F, et al. A multiview-based parameter free framework for group detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2017. 4147--4153. Google Scholar

[144] Lee Y J, Grauman K. Predicting Important Objects for Egocentric Video Summarization. Int J Comput Vis, 2015, 114: 38-55 CrossRef Google Scholar

[145] Yang L, Cheng H, Su J. Pixel-to-Model Distance for Robust Background Reconstruction. IEEE Trans Circuits Syst Video Technol, 2016, 26: 903-916 CrossRef Google Scholar

[146] Clare R H, Bardelle C, Harper P. Industrial scale high-throughput screening delivers multiple fast acting macrofilaricides. Nat Commun, 2019, 10: 11 CrossRef PubMed ADS Google Scholar

[147] Zhao B, Li X, Lu X. TTH-RNN: Tensor-Train Hierarchical Recurrent Neural Network for Video Summarization. IEEE Transactions on Industrial Electronics, 2020, 10.1109/TIE.2020.2979573. Google Scholar

[148] Alemany S, Beltran J, Perez A, et al. Predicting hurricane trajectories using a recurrent neural network. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2019. 468--475. Google Scholar

[149] Yang H, Wang B, Lin S, et al. Unsupervised extraction of video highlights via robust recurrent auto-encoders. In: Proceedings of the IEEE International Conference on Computer Vision, 2015. 4633--4641. Google Scholar

[150] Zhang K, Chao W, Sha F, et al. Summary transfer: Exemplar-based subset selection for video summarization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 1059--1067. Google Scholar

[151] Mahasseni B, Lam M, Todorovic S. Unsupervised video summarization with adversarial LSTM networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2982--2991. Google Scholar

[152] Apostolidis E E, Adamantidou E, Metsai A I, et al. Unsupervised video summarization via attention-driven adversarial learning. In: Proceedings of the International Conference on MultiMedia Modeling, 2020. 492--504. Google Scholar

[153] Ji Z, Xiong K, Pang Y. Video Summarization With Attention-Based Encoder-Decoder Networks. IEEE Trans Circuits Syst Video Technol, 2020, 30: 1709-1717 CrossRef Google Scholar

[154] Rochan M, Ye L, Wang Y. Video summarization using fully convolutional sequence networks. In: Proceedings of the European Conference on Computer Vision, 2018. 347--363. Google Scholar

[155] Rochan M, Wang Y. Video summarization by learning from unpaired data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 7902--7911. Google Scholar

[156] Potapov D, Douze M, Harchaoui Z, et al. Category-specific video summarization. In: Proceedings of the European Conference on Computer Vision, 2014. 540--555. Google Scholar

[157] Gygli M, Grabner H, Riemenschneider H, et al. Creating summaries from user videos. In: Proceedings of the European Conference on Computer Vision, 2014. 505--520. Google Scholar

[158] Zeng K, Chen T, Niebles J C, et al. Title generation for user generated videos. In: Proceedings of the European Conference on Computer Vision, 2016. 609--625. Google Scholar

[159] Meng J, Wang S, Wang H, et al. Video summarization via multi-view representative selection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1189--1198. Google Scholar

[160] Anirudh R, Masroor A, Turaga P K. Diversity promoting online sampling for streaming video summarization. In: Proceedings of the IEEE International Conference on Image Processing, 2016. 3329--3333. Google Scholar

[161] Li X, Zhao B, Lu X. Key Frame Extraction in the Summary Space.. IEEE Trans Cybern, 2018, 48: 1923-1934 CrossRef PubMed Google Scholar

[162] Otani M, Nakashima Y, Rahtu E, et al. Rethinking the evaluation of video summaries. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 7596--7604. Google Scholar

[163] Rav-Acha A, Pritch Y, Peleg S. Making a long video short: dynamic video synopsis. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006. 1--7. Google Scholar

[164] Ra M, Kim W Y. Parallelized Tube Rearrangement Algorithm for Online Video Synopsis. IEEE Signal Process Lett, 2018, 25: 1186-1190 CrossRef ADS Google Scholar

[165] Ghatak S, Rup S, Majhi B. An improved surveillance video synopsis framework: a HSATLBO optimization approach. Multimed Tools Appl, 2020, 79: 4429-4461 CrossRef Google Scholar

[166] Nie Y, Xiao C, Sun H. Compact video synopsis via global spatiotemporal optimization.. IEEE Trans Visual Comput Graphics, 2013, 19: 1664-1676 CrossRef PubMed Google Scholar

[167] Wang S, Xu W, Chao W, et al. A framework for surveillance video fast browsing based on object flags. In: Proceedings of Pacific-rim Conference on Multimedia, 2013. 411--421. Google Scholar

[168] Feng S, Lei Z, Yi D, et al. Online content-aware video condensation. In: Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012. 2082--2087. Google Scholar

[169] Huang C R, Chung P C J, Yang D K. IEEE Trans Circuits Syst Video Technol, 2014, 24: 1417-1429 CrossRef Google Scholar

[170] Lu M, Wang Y, Pan G. Generating fluent tubes in video synopsis. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2013. 2292--2296. Google Scholar

[171] Chou C, Lin C, Chiang T, et al. Coherent event-based surveillance video synopsis using trajectory clustering. In: Proceedings of 2015 IEEE International Conference on Multimedia & Expo Workshops, 2015. 1--6. Google Scholar

[172] Zhu X, Loy C C, Gong S. Video synopsis by heterogeneous multi-source correlation. In: Proceedings of IEEE International Conference on Computer Vision, 2013. 81--88. Google Scholar

[173] Nie F, Wang Z, Wang R. IEEE Trans Pattern Anal Mach Intell, 2020, : 1-1 CrossRef PubMed Google Scholar

[174] Mahapatra A, Sa P K, Majhi B. MVS: A multi-view video synopsis framework. Signal Processing-Image Communication, 2016, 42: 31-44 CrossRef Google Scholar

[175] Lin W, Zhang Y, Lu J. Summarizing surveillance videos with local-patch-learning-based abnormality detection, blob sequence optimization, and type-based synopsis. Neurocomputing, 2015, 155: 84-98 CrossRef Google Scholar

[176] Pritch Y, Rav-Acha A, Gutman A, et al. Webcam synopsis: Peeking around the world. In: Proceedings of IEEE International Conference on Computer Vision, 2007. 1--8. Google Scholar

[177] Correa C D, Ma K. Dynamic video narratives. ACM Trans. Graph., 2010, 29: 88:1--88:9 https://doi.org/10.1145/1833349.1778825. Google Scholar

[178] Xu M, Li S Z, Li B, et al. A set theoretical method for video synopsis. In: Proceedings of the 1st ACM SIGMM International Conference on Multimedia Information Retrieval, 2008. 366--370. Google Scholar

[179] Li X, Wang Z, Lu X. Surveillance Video Synopsis via Scaling Down Objects. IEEE Trans Image Process, 2016, 25: 740-755 CrossRef PubMed ADS Google Scholar

[180] Nie Y, Li Z, Zhang Z. Collision-Free Video Synopsis Incorporating Object Speed and Size Changes. IEEE Trans Image Process, 2020, 29: 1465-1478 CrossRef PubMed ADS Google Scholar

[181] He Y, Gao C, Sang N. Graph coloring based surveillance video synopsis. Neurocomputing, 2017, 225: 64-79 CrossRef Google Scholar

[182] Yildiz A, Ozgur A, Akgul Y S. Fast non-linear video synopsis. In: Proceedings of International Symposium on Computer and Information Sciences, 2008. 1--6. Google Scholar

[183] Vural U, Akgul Y S. Eye-gaze based real-time surveillance video synopsis. Pattern Recognition Lett, 2009, 30: 1151-1159 CrossRef Google Scholar

[184] Kirkpatrick S, Gelatt C D, Vecchi M P. Optimization by Simulated Annealing. Science, 1983, 220: 671-680 CrossRef PubMed ADS Google Scholar

[185] Ghatak S, Rup S, Majhi B. HSAJAYA: An Improved Optimization Scheme for Consumer Surveillance Video Synopsis Generation. IEEE Trans Consumer Electron, 2020, 66: 144-152 CrossRef Google Scholar

[186] Venkata Rao R. Jaya: A simple and new optimization algorithm for solving constrained and unconstrained optimization problems. 105267/jijiec, 2016, : 19-34 CrossRef Google Scholar

[187] Ruan T, Wei S, Li J. Rearranging Online Tubes for Streaming Video Synopsis: A Dynamic Graph Coloring Approach. IEEE Trans Image Process, 2019, 28: 3873-3884 CrossRef PubMed ADS Google Scholar

[188] Liao W, Tu Z, Wang S, et al. Compressed-domain video synopsis via 3D graph cut and blank frame deletion. In: Proceedings of the on Thematic Workshops of ACM Multimedia, 2017. 253--261. Google Scholar

[189] Rui Zhong , Ruimin Hu , Zhongyuan Wang . Fast Synopsis for Moving Objects Using Compressed Video. IEEE Signal Process Lett, 2014, 21: 834-838 CrossRef ADS Google Scholar

[190] Zhang Z, Nie Y, Sun H. Multi-View Video Synopsis via Simultaneous Object-Shifting and View-Switching Optimization. IEEE Trans Image Process, 2020, 29: 971-985 CrossRef PubMed ADS Google Scholar

[191] Zhao Z Q, Zheng P, Xu S T. Object Detection With Deep Learning: A Review.. IEEE Trans Neural Netw Learning Syst, 2019, 30: 3212-3232 CrossRef PubMed Google Scholar

[192] Chin T, Ding R, Marculescu D. Adascale: towards real-time video object detection using adaptive scaling. 2019,. arXiv Google Scholar

[193] Alwando E H P, Chen Y T, Fang W H. CNN-Based Multiple Path Search for Action Tube Detection in Videos. IEEE Trans Circuits Syst Video Technol, 2020, 30: 104-116 CrossRef Google Scholar

[194] Yuan Y, Wang D, Wang Q. Memory-augmented temporal dynamic learning for action recognition. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence, 2019. 9167--9175. Google Scholar

[195] Li X, Chen M, Nie F, et al. Locality adaptive discriminant analysis. In: Proceedings of International Joint Conference on Artificial Intelligence, 2017. 2201--2207. Google Scholar

[196] Guadarrama S, Krishnamoorthy N, Malkarnenkar G, et al. Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE International Conference on Computer Vision, 2013. 2712--2719. Google Scholar

[197] Thomason J, Venugopalan S, Guadarrama S, et al. Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proceedings of the International Conference on Computational Linguistics, 2014. 1218--1227. Google Scholar

[198] Krishnamoorthy N, Malkarnenkar G, Mooney R J, et al. Generating natural-language video descriptions using text-mined knowledge. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2013. Google Scholar

[199] Kojima A, Tamura T, Fukunaga K. Int J Comput Vision, 2002, 50: 171-184 CrossRef Google Scholar

[200] Gong S, Xiang T. Recognition of group activities using dynamic probabilistic networks. In: Proceedings of the 9th IEEE International Conference on Computer Vision, 2003. 742--749. Google Scholar

[201] Bobick A F, Wilson A D. A state-based approach to the representation and recognition of gesture. IEEE Trans Pattern Anal Machine Intell, 1997, 19: 1325-1337 CrossRef Google Scholar

[202] Hanckmann P, Schutte K, Burghouts G J. Automated textual descriptions for a wide range of video events with 48 human actions. In: Computer Vision — ECCV 2012, Berlin: Springer, 2012. 372--380. Google Scholar

[203] Pan Y, Yao T, Li H, et al. Video captioning with transferred semantic attributes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 984--992. Google Scholar

[204] Baraldi L, Grana C, Cucchiara R. Hierarchical boundary-aware neural encoder for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 3185--3194. Google Scholar

[205] Cherian A, Wang J, Hori C, et al. Spatio-temporal ranked-attention networks for video captioning. 2020,. arXiv Google Scholar

[206] Nabati M, Behrad A. Video captioning using boosted and parallel Long Short-Term Memory networks. Comput Vision Image Understanding, 2020, 190: 102840 CrossRef Google Scholar

[207] Guo Y, Zhang J, Gao L. Exploiting long-term temporal dynamics for video captioning. World Wide Web, 2019, 22: 735-749 CrossRef Google Scholar

[208] Wang H, Gao C, Han Y. Sequence in sequence for video captioning. Pattern Recognition Lett, 2020, 130: 327-334 CrossRef Google Scholar

[209] Venugopalan S, Xu H, Donahue J, et al. Translating videos to natural language using deep recurrent neural networks. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015. 1494--1504. Google Scholar

[210] Long X, Gan C, de Melo G. Video Captioning with Multi-Faceted Attention. TACL, 2018, 6: 173-184 CrossRef Google Scholar

[211] Yu Y, Ko H, Choi J, et al. Video captioning and retrieval models with semantic attention. 2016,. arXiv Google Scholar

[212] Venugopalan S, Hendricks L A, Mooney R J, et al. Improving lSTM-based video description with linguistic knowledge mined from text. In: Proceedings of Conference on Empirical Methods in Natural Language Processing, 2016. 1961--1966. Google Scholar

[213] Zhao B, Li X, Lu X. CAM-RNN: Co-Attention Model Based RNN for Video Captioning. IEEE Trans Image Process, 2019, 28: 5552-5565 CrossRef PubMed ADS Google Scholar

[214] Hori C, Hori T, Lee T, et al. Attention-based multimodal fusion for video description. In: Proceedings of the IEEE International Conference on Computer Vision, 2017. 4203--4212. Google Scholar

[215] Aafaq N, Mian A, Liu W. Video Description. ACM Comput Surv, 2020, 52: 1-37 CrossRef Google Scholar

[216] Baraldi L, Grana C, Cucchiara R. Hierarchical boundary-aware neural encoder for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 3185--3194. Google Scholar

[217] Ren L, Qi G J, Hua K. Improving diversity of image captioning through variational autoencoders and adversarial learning. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2019. 263--272. Google Scholar

[218] Park J S, Rohrbach M, Darrell T, et al. Adversarial inference for multi-sentence video description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 6598--6608. Google Scholar

[219] Ren Z, Wang X, Zhang N, et al. Deep reinforcement learning-based image captioning with embedding reward. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1151--1159. Google Scholar

[220] Wang X, Chen W, Wu J, et al. Video captioning via hierarchical reinforcement learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 4213--4222. Google Scholar

[221] Zhang W, Wang B, Ma L, et al. Reconstruct and represent video contents for captioning via reinforcement learning. 2019,. arXiv Google Scholar

[222] Song J, Guo Y, Gao L. From Deterministic to Generative: Multimodal Stochastic RNNs for Video Captioning.. IEEE Trans Neural Netw Learning Syst, 2019, 30: 3047-3058 CrossRef PubMed Google Scholar

[223] Nabati M, Behrad A. Video captioning using boosted and parallel Long Short-Term Memory networks. Comput Vision Image Understanding, 2020, 190: 102840 CrossRef Google Scholar

[224] Wu A, Han Y. Hierarchical memory decoding for video captioning. 2020,. arXiv Google Scholar

[225] Chen D L, Dolan W B. Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011. 190--200. Google Scholar

[226] Regneri M, Rohrbach M, Wetzel D. Grounding Action Descriptions in Videos. TACL, 2013, 1: 25-36 CrossRef Google Scholar

[227] Torabi A, Pal C J, Larochelle H, et al. Using descriptive video services to create a large data source for video annotation research. 2015,. arXiv Google Scholar

[228] Rohrbach A, Rohrbach M, Tandon N, et al. A dataset for movie description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 3202--3212. Google Scholar

[229] Xu J, Mei T, Yao T, et al. MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 5288--5296. Google Scholar

[230] Sigurdsson G A, Varol G, Wang X, et al. Hollywood in homes: crowdsourcing data collection for activity understanding. In: Proceedings of the European Conference on Computer Vision, 2016. 510--526. Google Scholar

[231] Papineni K, Roukos S, Ward T, et al. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2002. 311--318. Google Scholar

[232] Lin C, Och F J. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2004. 605--612. Google Scholar

[233] Vedantam R, Zitnick C L, Parikh D. Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 4566--4575. Google Scholar

[234] Denkowski M J, Lavie A. Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the Workshop on Statistical Machine Translation, 2014. 376--380. Google Scholar

[235] Xu J, Yao T, Zhang Y, et al. Learning multimodal attention LSTM networks for video captioning. In: Proceedings of the ACM International Conference on Multimedia, 2017. 537--545. Google Scholar

[236] Yu H, Wang J, Huang Z, et al. Video paragraph captioning using hierarchical recurrent neural networks. In: IEEEConference on Computer Vision and Pattern Recognition, 2016. 4584--4593. Google Scholar

[237] Liu S, Ren Z, Yuan J. Sibnet: sibling convolutional encoder for video captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. 10.1109/TPAMI.2019.2940007. Google Scholar

[238] Ballas N, Yao L, Pal C, et al. Delving deeper into convolutional networks for learning video representations. In: Proceedings of International Conference on Learning Representations, 2016. 1--11. Google Scholar

[239] Chen Y, Wang S, Zhang W, et al. Less is more: picking informative frames for video captioning. In: Proceedings of European Conference on Computer Vision, 2018. 367--384. Google Scholar

[240] Gan Z, Gan C, He X, et al. Semantic compositional networks for visual captioning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1141--1150. Google Scholar

[241] Wang J, Wang W, Huang Y, et al. M3: multimodal memory modelling for video captioning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 7512--7520. Google Scholar

[242] Krishna R, Hata K, Ren F, et al. Dense-captioning events in videos. In: Proceedings of IEEE International Conference on Computer Vision, 2017. 706--715. Google Scholar

[243] Zhang J, Peng Y. Object-aware aggregation with bidirectional temporal graph for video captioning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 8327--8336. Google Scholar

[244] Zhang X, Gao K, Zhang Y, et al. Task-driven dynamic fusion: Reducing ambiguity in video description. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 3713--3721. Google Scholar

[245] Shetty R, Laaksonen J. Frame-and segment-level features and candidate pool evaluation for video caption generation. In: Proceedings of ACM Conference on Multimedia Conference, 2016. 1073--1076. Google Scholar

[246] Hori C, Hori T, Lee T, et al. Attention-based multimodal fusion for video description. In: Proceedings of the IEEE International Conference on Computer Vision, 2017. 4203--4212. Google Scholar

[247] Mun J, Yang L, Ren Z, et al. Streamlined dense video captioning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 6588--6597. Google Scholar

[248] Li X. Features, indexing, and interaction in content-based image retrieval. Dissertation for Ph.D. Degree. Hefei: University of Science and Technology of China, 2002. Google Scholar

qqqq

Contact and support