SCIENTIA SINICA Informationis, Volume 47 , Issue 8 : 953(2017) https://doi.org/10.1360/N112017-00125

Survey of evaluation methods for dialogue systems}{Survey of evaluation methods for dialogue systems

More info
  • ReceivedApr 6, 2017
  • AcceptedJun 21, 2017
  • PublishedJul 24, 2017


Funded by





[1] Turing A M. Computing machinery and intelligence. Mind, 1950, 59: 433-460. Google Scholar

[2] Walker M A, Litman D J, Kamm C A, et al. PARADISE: a framework for evaluating spoken dialogue agents. In: Proceeding of the 8th Conference on European Chapter of the Association for Computational Linguistics, Madrid, 1997. 271-280. Google Scholar

[3] Rieser V, Lemon O. Learning and evaluation of dialogue strategies for new applications: empirical methods for optimization from small data sets. Computat Linguist, 2011, 37: 153-196 CrossRef Google Scholar

[4] Larsen L B. Issues in the evaluation of spoken dialogue systems using objective and subjective measures. In: Proceedings of 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, St Thomas, 2003. 209-214. Google Scholar

[5] Yang Z J, Levow G, Meng H. Predicting user satisfaction in spoken dialog system evaluation with collaborative filtering. IEEE J Select Topics Signal Proc, 2012, 6: 971-981 CrossRef Google Scholar

[6] Asri L E, Laroche R, Pietquin O. Task completion transfer learning for reward inference. In: Proceeding of AAAI Workshop on Machine Learning for Interactive Systems, Quebec, 2014. 38-43. Google Scholar

[7] Ultes S, Minker W. Quality-adaptive spoken dialogue initiative selection and implications on reward modelling. In: Proceeding of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Prague, 2015. 374-383. Google Scholar

[8] Su P H, Ga$\check{\text{s}}$i$\acute{\text{c}}$ M, Mrk$\check{\text{s}}$i$\acute{\text{c}}$ N, et al. On-line active reward learning for policy optimisation in spoken dialogue systems. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, 2016. 2431-2441. Google Scholar

[9] Young S, Ga$\check{\text{s}}$i$\acute{\text{c}}$ M, Thomson, B. POMDP-based statistical spoken dialogue systems: a review. Proc IEEE, 2012, 101: 1160-1179. Google Scholar

[10] Hirschman L, Thompson H S. Overview of evaluation in speech and natural language processing. In: Survey of the State of the Art in Human Language Technology. New York: Cambridge University Press, 1997. 409-414. Google Scholar

[11] Watambe T, Araki M, Doshita S. Evaluating dialogue strategies under communication errors using computer-to-computer simulation. IEICE Trans Inform Syst, 1998, E81-D: 1025-1033. Google Scholar

[12] Ai H, Weng F. User simulation as testing for spoken dialog systems. In: Proceedings of Annual Meeting of the Special Interest Group on Discourse and Dialogue, Columbus, 2008. 164-171. Google Scholar

[13] Schatzmann J. Statistical user and error modelling for spoken dialogue systems. Dissertation for Ph.D. Degree. Cambridge: University of Cambridge, 2008. Google Scholar

[14] Williams J. Applying POMDPs to dialog systems in the troubleshooting domain. In: Proceedings of the HLT/NAACL Workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technology, New York, 2007. 1-8. Google Scholar

[15] Thomson B, Young S. Bayesian update of dialogue state: a POMDP framework for spoken dialogue systems. Comput Speech Language, 2010, 24: 562-588 CrossRef Google Scholar

[16] Henderson J, Lemon O, Georgila K. Hybrid reinforcement supervised learning for dialogue policies from communicator data. In: Proceedings of the IJCAI Workshop on Knowledge and Reasoning in Practical Dialog Systems, Edinburgh, 2005. 68-75. Google Scholar

[17] Ga$\check{\text{s}}$i$\acute{\text{c}}$ M, Lefevre F, Jur$\check{\text{c}}\text{i}\check{\text{c}}$ek F, et al. Back-off action selection in summary space-based POMDP dialogue systems. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, Merano, 2009. 456-461. Google Scholar

[18] Jur$\check{\text{c}}\text{i}\check{\text{c}}$ek F, Keizer S, Ga$\check{\text{s}}$i$\acute{\text{c}}$ M, et al. Real user evaluation of spoken dialogue systems using Amazon Mechanical Turk. In: Proceedings of Interspeech Conference, Florence, 2011. 3061-3064. Google Scholar

[19] McGraw I, Lee C, Hetherington L, et al. Collecting voices from the cloud. In: Proceedings International Conference on Language Resources and Evaluation, Malta, 2010. 1576-1583. Google Scholar

[20] Ga$\check{\text{s}}$i$\acute{\text{c}}$ M, Jur$\check{\text{c}}\text{i}\check{\text{c}}$ek F, Thomson B, et al. On-line policy optimisation of spoken dialogue systems via live interaction with human subjects. In: Proceedings of IEEE Workshop on Automatic Speech Recognition & Understanding, Hawaii, 2011. 312-317. Google Scholar

[21] Black A, Burger S, Conkie A, et al. Spoken dialog challenge 2010: comparison of live and control test results. In: Proceedings of Annual Meeting of the Special Interest Group on Discourse and Dialogue, Portland, 2011. 2-7. Google Scholar

[22] Papineni K, Roukos S, Ward T, et al. BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, 2002. 311-318. Google Scholar

[23] Banerjee S, Lavie A. METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, 2005. Google Scholar

[24] Lin C Y. Rouge: a package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out, Barcelona, 2004. 25-26. Google Scholar

[25] Rus V, Lintean M. A comparison of greedy and optimal assessment of natural language student input using word-to-word similarity metrics. In: Proceedings of the 7th Workshop on Building Educational Applications Using NLP, Stroudsburg, 2012. 157-162. Google Scholar

[26] Wieting J, Bansal M, Gimpel K, et al. Towards universal paraphrastic sentence embeddings. arXiv: 1511.08198. Google Scholar

[27] Forgues G, Pineau J, Larcheveque J M, et al. Bootstrapping dialog systems with word embeddings. In: Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue, Cambridge, 2004. Google Scholar

[28] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality. In: Proceedings of International Conference on Neural Information Processing Systems, Lake Tahoe, 2013. 3111-3119. Google Scholar

[29] Charlin L, Pineau J. How NOT to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv: 1603.08023v2. Google Scholar

[30] Anjuli Kannan, Oriol Vinyals. Adversarial Evaluation of Dialogue Models. arXiv: 1701.08198v1. Google Scholar

[31] Lowe R, Noseworthy M, Serban I V, et al. Towards an automatic turing test: learning to evaluate dialogue responses. 2017. In press. Google Scholar

[32] Lowe R, Serban I V, Noseworthy M, et al. On the evaluation of dialogue systems with next utterance classification. In: Proceedings of Annual Meeting of the Special Interest Group on Discourse and Dialogue, Los Angeles, 2016. Google Scholar

[33] Lowe N, Pow N, Serban J I, et al. The ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems. In: Proceedings of Annual Meeting of the Special Interest Group on Discourse and Dialogue, Prague, 2015. Google Scholar

[34] Ritter A, Cherry C, Dolan B. Unsupervised modeling of twitter conversations. In: Proceedings of Annual Conference on North American Chapter of the Association for Computational Linguistics (NAACL), Los Angeles, 2010. 172-180. Google Scholar

[35] Banchs R E. Movie-dic: a movie dialogue corpus for research and development. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Cincinnati, 2012. Google Scholar