SCIENTIA SINICA Informationis, Volume 46 , Issue 8 : 969-981(2016) https://doi.org/10.1360/N112016-00072

Towards real world perception and interaction

More info
  • ReceivedMar 30, 2016
  • AcceptedMay 30, 2016


Funded by



[1] Silver D, Huang A, Maddison C J, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529: 484-489 CrossRef Google Scholar

[2] Roberts L. Machine perception of three-dimensional solids. In: Optical and Electron-optical Information Processing. Cambridge: MIT Press, 1965. 159-197. Google Scholar

[3] Marr D. Vision: a Computational Investigation Into the Human Representation and Processing of Visual Information. Cambridge: MIT Press, 2010. Google Scholar

[4] Jain R C, Binford T O. Ignorance, myopia, and naiveté in computer vision systems. CVGIP: Image Und, 1991, 53: 112-117 CrossRef Google Scholar

[5] Brooks R A. Intelligence without reason. In: Proceedings of the 12th International Joint Conference on Artificial Intelligence, Sydney, 1991. 569-595. Google Scholar

[6] Zhang Z. Microsoft Kinect sensor and its effect. IEEE Multimed, 2012, 19: 4-10. Google Scholar

[7] Shotton J, Sharp T, Kipman A, et al. Real-time human pose recognition in parts from single depth images. Commun ACM, 2013, 56: 116-124 CrossRef Google Scholar

[8] Fankhauser P, Bloesch M, Rodriguez D, et al. Kinect v2 for mobile robot navigation: evaluation and modeling. In: Proceedings of the 17th International Conference on Advanced Robotics, Istanbul, 2015. 388-394. Google Scholar

[9] Han J, Shao L, Xu D, et al. Enhanced computer vision with Microsoft Kinect sensor: a review. IEEE Trans Cyber, 2013, 43: 1318-1334 CrossRef Google Scholar

[10] Leibe B, Schiele B. Analyzing appearance and contour based methods for object categorization. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Madison, 2003. 2: 409-415. Google Scholar

[11] Li F-F, Rob F, Pietro P. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. Comput Vis Image Und, 2007, 106: 59-70 CrossRef Google Scholar

[12] Everingham M, van Gool L, Williams C K I, et al. The PASCAL visual object classes (VOC) challenge. Int J Comput Vision, 2010, 88, 303-338. Google Scholar

[13] Deng J, Dong W, Socher R, et al. ImageNet: a large-scale hierarchical image database. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Miami Beach, 2009. 248-255. Google Scholar

[14] Krishna R, Zhu Y, Groth O, et al. Visual genome: connecting language and vision using crowdsourced dense image annotations. arXiv:1602.07332. Google Scholar

[15] Thomee B, Elizalde B, Shamma D, et al. YFCC100M: the new data in multimedia research. Commun ACM, 2016, 59: 64-73. Google Scholar

[16] Viola P, Jones R. Rapid object detection using a boosted cascade of simple features. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, 2001. 2: 524-531. Google Scholar

[17] Fei-Fei L, Perona P. A Bayesian hierarchical model for learning natural scene categories. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, San Diego, 2005. 2: 524-531. Google Scholar

[18] Krizhevsky A, Sutskever I, Hinton G. ImageNet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems 25, Lake Tahoe, 2012. 1097-1105. Google Scholar

[19] Lake B M, Salakhutdinov R, Tenenbaum J B. Human-level concept learning through probabilistic program induction. Science, 2015, 350: 1332-1338 CrossRef Google Scholar

[20] Chen J, Chen B. Architectural modeling from sparsely scanned range data. Int J Comput Vision, 2007, 78: 223-236. Google Scholar

[21] Lin H, Gao J, Zhou Y, et al. Semantic decomposition and reconstruction of residential scenes from LiDAR data. ACM Trans Graphics, 2013, 32: 1-10. Google Scholar

[22] Agarwala S, Furukawaa Y, Snavely N, et al. Building Rome in a day. Commun ACM, 2011, 54: 105-112 CrossRef Google Scholar

[23] Newcombe R A, Izadi S, Hilliges O, et al. KinectFusion: real-time dense surface mapping and tracking. In: Proceedings of the 10th IEEE International Symposium on Mixed and Augmented Reality, Basel, 2011. 127-136. Google Scholar

[24] Henry P, Krainin M, Herbst E, et al. RGB-D mapping: using depth cameras for dense 3D modeling of indoor environments. In: Proceedings of the International Symposium on Experimental Robotics, New Delhi and Agra, 2010. 22-25. Google Scholar

[25] Nan L, Xie K, Sharf A. A search-classify approach for cluttered indoor scene understanding. ACM Trans Graphics, 2012, 31: 1-10. Google Scholar

[26] Chen K. Lai Y-K, Wu Y-X, et al. Automatic semantic modeling of indoor scenes from low-quality RGB-D data using contextual information. ACM Trans Graphics, 2014, 33: 1-12. Google Scholar

[27] Zhang L, Vazquez C, Knorr S. 3D-TV content creation: automatic 2D-to-3D video conversion. IEEE Trans Broadcast, 2011, 57: 372-383 CrossRef Google Scholar

[28] Karsch K, Liu C, Kang S B. Depth transfer: depth extraction from video using non-parametric sampling. IEEE Trans Pattern Anal Mach Intell, 2014, 36: 2144-2158 CrossRef Google Scholar

[29] Song Y M, Xie Y, Malyarchuk Y, et al. Digital cameras with designs inspired by the arthropod eye. Nature, 2013, 497: 95-99 CrossRef Google Scholar

[30] Yokoya R, Nayar S K. Extended depth of field catadioptric imaging using focal sweep. In: Proceedings of the 15th IEEE International Conference on Computer Vision, Santiago, 2015. 3505-3513. Google Scholar

[31] Nayar S, Mitsunaga T. High dynamic range imaging: spatially varying pixel exposures. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head, 2000. 472-479. Google Scholar

[32] Hinton G, Deng L, Yu D, et al. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag, 2012, 29: 82-97. Google Scholar

[33] Johnson A, Hebert M. Using spin images for efficient object recognition in cluttered 3D scenes. IEEE Trans Pattern Anal Mach Intell, 1999, 21: 433-449 CrossRef Google Scholar

[34] Bo L, Ren X, Fox D. Depth kernel descriptors for object recognition. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, San Francisco, 2011. 821-826. Google Scholar

[35] Xiao J, Owens A, Torralba A. SUN3D: a database of big spaces reconstructed using SfM and object labels. In: Proceedings of the 14th IEEE International Conference on Computer Vision, Sydney, 2013. 1625-1632. Google Scholar

[36] Jacob M G, Li Y-T, Akingba G A, et al. Collaboration with a robotic scrub nurse. Commun ACM, 2013, 56: 68-75. Google Scholar

[37] Chai X, Li G, Chen X, et al. VisualComm: a tool to support communication between deaf and hearing persons with the Kinect. In: Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility, Bellevue, 2013. 76. Google Scholar

[38] Häuslschmid R, Menrad B, Butz A. Freehand vs. micro gestures in the car: driving performance and user experience. In: Proceedings of IEEE Symposium on 3D User Interfaces (3DUI), Arles, 2015. 159-160. Google Scholar

[39] Lampert C H, Nickisch H, Harmeling S. Attribute-based classification for zero-shot visual object categorization. IEEE Trans Pattern Anal Mach Intell, 2014, 36: 453-465 CrossRef Google Scholar

[40] Liang K, Chang H, Shan S, et al. A unified multiplicative framework for attribute learning. In: Proceedings of the 15th International Conference on Computer Vision, Santiago, 2015. 2506-2514. Google Scholar

[41] Malinowski M, Rohrbach M, Fritz M. Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the 15th International Conference on Computer Vision, Santiago, 2015. 1-9. Google Scholar

[42] Liu H, Wang R, Shan S, et al. Deep supervised hashing for fast image retrieval. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. Google Scholar

[43] Ross P. When will software have the right stuff? IEEE Spectrum, 2011, 48: 38-43. Google Scholar

[44] Kirby M, Sirovich L. Application of the Karhunen-Loeve procedure for the characterization of human faces. IEEE Trans Pattern Anal Mach Intell, 1990, 12: 103-108 CrossRef Google Scholar

[45] Schmidhuber J. Learning complex extended sequences using the principle of history compression. Neural Comput, 1992, 4: 234-242 CrossRef Google Scholar