logo

SCIENTIA SINICA Informationis, Volume 48 , Issue 4 : 433-448(2018) https://doi.org/10.1360/N112017-00211

Intelligence methods of multi-modal information fusion in human-computer interaction

More info
  • ReceivedOct 30, 2017
  • AcceptedMar 1, 2018
  • PublishedApr 13, 2018

Abstract


Funded by

国家重点研发计划(2017YFB1002804)

国家自然科学基金(61332017,61425017)


References

[1] Cohen P R, McGee D. Tangible multimodal interfaces for safety-critical applications. Commun ACM, 2004, 47: 1--46. Google Scholar

[2] Jaimes A, Sebe N. Multimodal human-computer interaction: A survey. Comput Vision Image Understanding 2007, 108: 116--134. Google Scholar

[3] Meyer S, Rakotonirainy A. A survey of research on context-aware homes. In: Proceedings of Australasian Information Security Workshop Conference on ACSW Frontiers, Adelaide, 2003. 159--168. Google Scholar

[4] Yang M H, Tao J H, Li H, et al. A nature multimodal human-computer-interaction dialog system. In: Proceedings of the 9th Joint Conference on Harmonious Human Machine Environment (HMME), Nanchang, 2013. Google Scholar

[5] Yang M H, Gao T L, Tao J H, et al. The error analysis of intention classification and speech recognition in speech man-machine conversation. In: Proceedings of the 11th Joint Conference on Harmonious Human Machine Environment, Huludao, 2015. Google Scholar

[6] Duric Z, Gray W D, Heishman R, et al. Integrating perceptual and cognitive modeling for adaptive and intelligent human-computer interaction. Proc IEEE, 2002. 1272--1289. Google Scholar

[7] Wang L, Hu W, Tan T. Recent developments in human motion analysis. Pattern Recognition 2003, 36: 585--601. Google Scholar

[8] Seely R D, Goffredo M, Carter J N, et al. View invariant gait recognition. Adv Pattern Recogn, 2009, 23: 1410--1413. Google Scholar

[9] Chin K Y, Hong Z W, Chen Y L. Impact of Using an Educational Robot-Based Learning System on Students' Motivation in Elementary Education. IEEE Trans Learning Technol 2014, 7: 333--345. Google Scholar

[10] Pierre-Yves O. The production and recognition of emotions in speech: features and algorithms. Int J Human-Comput Studies 2003, 59: 157--183. Google Scholar

[11] Chorowski J, Bahdanau D, Serdyuk D, et al. Attention-based models for speech recognition. Comput Sci, 2015, 10: 429--439. Google Scholar

[12] Ming-Hsuan Yang , Kriegman D J, Ahuja N. Detecting faces in images: a survey. IEEE Trans Pattern Anal Machine Intell 2002, 24: 34--58. Google Scholar

[13] Zhao W, Chellappa R, Phillips P J, et al. Face recognition: a literature survey. ACM Comput Surv, 2003, 12: 399--458. Google Scholar

[14] Pantie M, Rothkrantz L J M. Automatic analysis of facial expressions: the state of the art. IEEE Trans Pattern Anal Machine Intell 2000, 22: 1424--1445. Google Scholar

[15] Tao J H, Tan T. Affective computing: a review. In: Proceedings of Affective Computing & Intelligent Interaction. Berlin: Springer, 2005. 981--995. Google Scholar

[16] Chao L L, Tao J H, Yang M H, et al. Long short term memory recurrent neural network based multimodal dimensional emotion recognition. In: Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, Brisbane, 2015. 65--72. Google Scholar

[17] Wang S J, Yan W J, Li X, et al. Micro-Expression Recognition Using Color Spaces. IEEE Trans Image Process 2015, 24: 6034--6047. Google Scholar

[18] He L, Jiang D, Yang L, et al. Multimodal affective dimension prediction using deep bidirectional long short-term memory recurrent neural networks. In: Proceedings of ACM International Workshop on Audio/visual Emotion Challenge, Brisbane, 2015. 73--80. Google Scholar

[19] Ge L H, Liang H, Yuan J S, et al. 3D convolutional neural networks for efficient and robust hand pose estimation from single depth images. In: Proceedings of Computer Vision and Pattern Recognition, Honolulu, 2017. 1991--2000. Google Scholar

[20] Zimmermann C, Brox T. Learning to estimate 3D hand pose from single RGB images. In: Proceedings of International Conference on Computer Vision, Venice, 2017. 4903--4911. Google Scholar

[21] Ruffieux S, Lalanne D, Mugellini E, et al. A survey of datasets for human gesture recognition. In: Proceedings of International Conference on Human-Computer Interaction. Berlin: Springer, 2014. 337--348. Google Scholar

[22] Rautaray S S, Agrawal A. Vision based hand gesture recognition for human computer interaction: a survey. Artif Intell Rev 2015, 43: 1--54. Google Scholar

[23] Hu W, Tan T, Wang L, et al. A survey on visual surveillance of object motion and behaviors. IEEE Trans Syst Man Cybernet, 2004, 34: 334--352. Google Scholar

[24] Fagiani C, Betke M, Gips J. Evaluation of tracking methods for human-computer interaction. In: Proceedings of the 6th IEEE Workshop on Applications of Computer Vision. Washington: IEEE Computer Society, 2002. 121--126. Google Scholar

[25] Oviatt S, Cohen P, Wu L, et al. Designing the User Interface for Multimodal Speech and Pen-Based Gesture Applications: State-of-the-Art Systems and Future Research Directions. Human-Comput Interaction 2000, 15: 263--322. Google Scholar

[26] Tian F, Xu L S, Wang H A, et al. Tilt menu: using the 3D orientation information of pen devices to extend the selection capability of pen-based user interfaces. In: Proceedings of Conference on Human Factors in Computing Systems, Florence, 2008. 1371--1380. Google Scholar

[27] Tian F, Lu F, Jiang Y, et al. An exploration of pen tail gestures for interactions. Int J Human-Comput Studies 2013, 71: 551--569. Google Scholar

[28] Pelz J B. Portable eye-tracking in natural behavior. J Vis, 2004, 4: 14. Google Scholar

[29] Santella A, de Carlo D. Robust clustering of eye movement recordings for quantification of visual interest. Eye Tracking Res Appl, 2004, 23: 27--34. Google Scholar

[30] Cheng S, Sun Z, Sun L, et al. Gaze-based annotations for reading comprehension. In: Proceedings of ACM Conference on Human Factors in Computing Systems, Seoul, 2015. 1569--1572. Google Scholar

[31] Yu C, Sun K, Zhong M Y, et al. One-dimensional handwriting: inputting letters and words on smart glasses. In: Proceedings of CHI Conference on Human Factors in Computing Systems, San Jose, 2016. 71--82. Google Scholar

[32] Yu C, Wen H Y, Xiong W, et al. Investigating effects of post-selection feedback for acquiring ultra-small targets on touchscreen. In: Proceedings of CHI Conference on Human Factors in Computing Systems, San Jose, 2016. 4699--4710. Google Scholar

[33] Wang D, Zhao X, Shi Y, et al. Six Degree-of-Freedom Haptic Simulation of Probing Dental Caries Within a Narrow Oral Cavity.. IEEE Trans Haptics 2016, 9: 279--291. Google Scholar

[34] Yang W, Jiang Z, Huang X. Tactile perception of digital images. In: Proceedings of International AsiaHaptics Conference, Singapore, 2016. 445--447. Google Scholar

[35] Paivio A. Mental Representation: A Dual Coding Approach. New York: Oxford University Press, 1986. Google Scholar

[36] Baddeley A D. Working Memory. Oxford: Clarendon Press, 1986. Google Scholar

[37] Nelson C. What are the differences between long-term, short-term, and working memory. Prog Brain Res, 2008, 169: 323--338. Google Scholar

[38] Baddeley A. Working memory: looking back and looking forward.. Nat Rev Neurosci 2003, 4: 829--839. Google Scholar

[39] The Effect of Word Length on Immediate Serial Recall Depends on Phonological Complexity, Not Articulatory Duration. Q J Exp Psychology Sect A 1998, 51: 283--304. Google Scholar

[40] Just M A, Carpenter P A. A capacity theory of comprehension: Individual differences in working memory.. Psychological Rev 1992, 99: 122--149. Google Scholar

[41] The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behav Brain Sci 2001, 24: 87--114. Google Scholar

[42] Chooi W T, Thompson L A. Working memory training does not improve intelligence in healthy young adults. Intelligence 2012, 40: 531--542. Google Scholar

[43] Barrouillet P, Bernardin S, Camos V. Time constraints and resource sharing in adults' working memory spans. J erimental Psychol, 2004, 133: 83--100. Google Scholar

[44] The relationship between processing and storage in working memory span: Not two sides of the same coin. J Memory Language 2007, 56: 212--228. Google Scholar

[45] Reducing the Dimensionality of Data with Neural Networks. Science 2006, 313: 504--507. Google Scholar

[46] Vincent P, Larochelle H, Lajoie I, et al. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res, 2010, 11: 3371--3408. Google Scholar

[47] Chao L, Tao J, Yang M, et al. Bayesian fusion based temporal modeling for naturalistic audio affective expression classification. In: Proceedings of the 5th International Conference on Affective Computing and Intelligent Interaction, Geneva, 2013. Google Scholar

[48] Miao Y J, Gowayyed M, Metze F. Eesen: end-to-end speech recognition using deep rnn models and wfst-based decoding. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, 2015. 167--174. Google Scholar

[49] Caramiaux B, Montecchio N, Tanaka A. Adaptive gesture recognition with variation estimation for interactive systems. ACM Trans Interact Intell Syst, 2015, 4: 18. Google Scholar

[50] Mayer R E. Multimedia learning. Psychol Learn Motivation, 2002, 41: 85--139. Google Scholar

[51] Revlin R. Cognition: Theory and Practice. New York: Worth Publishers, 2012. Google Scholar

[52] Fournet N, Roulin J, Vallet F. Evaluating short-term and working memory in order adults: French normative data. Aging Mental Health, 2012, 16: 922--930NAN文献为空. Google Scholar

[53] Mayer R E. Multimedia Learning. 2nd ed. New York: Cambridge University Press, 2009. Google Scholar

[54] Ernst M O, Banks M S. Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 2002, 415: 429--433. Google Scholar

[55] Gunes H, Piccardi M. Affect recognition from face and body: early fusion vs. late fusion. In: Proceedings of IEEE International Conference on Systems, Man and Cybernetics, Waikoloa, 2006. 3437--3443. Google Scholar

[56] Yang M H, Tao J H, Li H, et al. A nature multimodal human-computer-interaction dialog system. In: Proceedings of the 9th Joint Conference on Harmonious Human Machine Environment (CHCI2013), Nanchang, 2013. Google Scholar

[57] Yang M H, Tao J H, Chao L. User behavior fusion in dialog management with multi-modal history cues. Multimedia Tools Appl, 2015, 74: 10025--10051. Google Scholar

[58] Li X, Gao F, Wang J. A priori knowledge accumulation and its application to linear BRDF model inversion. J Geophys Res-Atmo, 2001, 106: 11925--11935. Google Scholar

[59] Liu F, Lin X, Li S Z. Multi-modal face tracking using Bayesian network. In: Proceedings of IEEE International Workshop on Analysis and Modeling of Faces and Gestures, Nice, 2003. 135. Google Scholar

[60] Town C. Multi-sensory and multi-modal fusion for sentient computing. Int J Comput Vis, 2007, 71: 235--253. Google Scholar

[61] Pradalier C, Colas F, Bessiere P. Expressing bayesian fusion as a product of distributions: applications in robotics. In: Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, Las Vegas, 2015. 1851--1856. Google Scholar

[62] Savran A, Cao H, Nenkova A. Temporal bayesian fusion for affect sensing: combining video, audio, and lexical modalities. IEEE Trans Cybern, 2015, 45: 1927. Google Scholar

[63] Li W, Lin G. An adaptive importance sampling algorithm for Bayesian inversion with multimodal distributions. J Comput Phys, 2015, 294: 173--190. Google Scholar

[64] Yu D, Deng L, He X, et al. Large-margin minimum classification error training for large-scale speech recognition tasks. In: Proceedings of IEEE International Conference on Acoustics, Honolulu, 2016. Google Scholar

[65] He K, Zhang X, Ren S, et al. Delving deep into rectifiers: surpassing human-level performance on imageNet classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Santiago, 2015. 1026--1034. Google Scholar

[66] Yang F, Choi W, Lin Y. Exploit all the layers: fast and accurate CNN object detector with scale dependent pooling and cascaded rejection classifiers. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016. 2129--2137. Google Scholar

[67] Ngiam J, Khosla A, Kim M, et al. Multimodal deep learning. In: Proceedings of International Conference on Machine Learning, Bellevue, 2011. 689--696. Google Scholar

[68] Collobert R, Weston J. A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine learning, Helsinki, 2008. 160--170. Google Scholar

[69] Seltzer M L, Droppo J. Multi-task learning in deep neural networks for improved phoneme recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, 2013. 6965--6969. Google Scholar

[70] Tzeng E, Hoffman J, Darrell T. Simultaneous deep transfer across domains and tasks. In: Proceedings of IEEE International Conference on Computer Vision, Santiago, 2015. 4068--4076. Google Scholar

[71] Lukasz K, Gomez A N, Shazeer N, et al. One model to learn them all. 2017,. arXiv Google Scholar

[72] Wang C, Gorce M D L, Paragios N. Segmentation, ordering and multi-object tracking using graphical models. In: Proceedings of IEEE International Conference on Computer Vision, Kyoto, 2010. 747--754. Google Scholar

[73] Wei F, Li W, Lu Q. A document-sensitive graph model for multi-document summarization. Knowl Inform Syst, 2010, 22: 245--259. Google Scholar

[74] Myunghwan K, Jure L. Latent multi-group membership graph model. Comput Sci, 2012, 80. Google Scholar

[75] Honorio J, Samaras D. Multi-task learning of gaussian graphical models. In: Proceedings of International Conference on Machine Learning, Haifa, 2010. 447--454. Google Scholar

[76] Lake B M, Salakhutdinov R, Tenenbaum J B. Human-level concept learning through probabilistic program induction. Science, 2015, 350: 1332--1338. Google Scholar

[77] Wu J X, Cheng J, Zhao C Y, et al. Fusing multi-modal features for gesture recognition. In: Proceedings of the 15th ACM on International Conference On Multimodal Interaction, Sydney, 2013. 453--460. Google Scholar

[78] Hamoud L, Kilgour D M, Hipel K W. Strength of preference in graph models for multiple-decision-maker conflicts. Appl Math Comput, 2006, 179: 314--332. Google Scholar

[79] Oviatt S L. Mutual disambiguation of recognition errors in a multimodal architecture. In: Proceedings of ACM Conference Human Factors in Computing Systems, Pittsburgh, 1999. 576--583. Google Scholar

[80] Wahlster W. SmartKom: symmetric multimodality in an adaptive and reusable dialogue shell. In: Proceedings of the Human Computer Interaction Status Conference, Berlin, 2003. 47--62. Google Scholar

[81] McGuire P, Fritsch J, Steil J J. Multi-modal human-machine communication for instructing robot grasping tasks. In: Proceedings of LEEE/RSJ International Conference on Intelligent Robots and Systems, Lausanne, 2002. 1082--1088. Google Scholar

[82] Michaelis J E, Mutlu B. Someone to read with design of and experiences with an in-home learning companion robot for reading. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, Denver, 2017. 301--312. Google Scholar

[83] Cheng A, Yang L, Andersen E. Teaching language and culture with a virtual reality game. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, Denver, 2017. 541--549. Google Scholar

[84] Sun M, Zhao Z, Ma X. Sensing and handling engagement dynamics in human-robot interaction involving peripheral computing. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, Denver, 2017. 556--567. Google Scholar

[85] Pham T A, Delalandre M, Barrat S. Accurate junction detection and characterization in line-drawing images. Pattern Recogn, 2013, 47: 282--295. Google Scholar

[86] Zhao B, Yang M, Pan H, et al. Nonrigid point matching of Chinese characters for robot writing. In: Proceedings of the IEEE International Conference on Robotics and Biomimetics, Macau, 2017. 762--767. Google Scholar

  • Table 1   Accuracy of users' dialogue intention $^{\rm~a)}$
    Correct ASRCorrect ASR and intentions Inaccurate ASR with correct intentions
    Number of correct feedback51144362
    Accuracy0.8190.867 0.549

    a