Please use this identifier to cite or link to this item: https://scholarbank.nus.edu.sg/handle/10635/166668
Title: Deep AM-FM: Toolkit for Automatic Dialogue Evaluation
Authors: Chen Zhang 
Luis Fernando D'Haro
Rafael E. Banchs
Thomas Friedrichs
Haizhou Li 
Issue Date: 17-Apr-2020
Publisher: Springer
Citation: Chen Zhang, Luis Fernando D'Haro, Rafael E. Banchs, Thomas Friedrichs, Haizhou Li (2020-04-17). Deep AM-FM: Toolkit for Automatic Dialogue Evaluation. ScholarBank@NUS Repository.
Rights: CC0 1.0 Universal
Abstract: There have been many studies on human-machine dialogue systems. To evaluate them accurately and fairly, many resort to human grading of system outputs. Unfortunately, this is time-consuming and expensive. The study of AM-FM (Adequacy Metric - Fluency Metric) suggests an automatic evaluation metric, that achieves good performance in terms of correlation with human judgements. AMFM framework intends to measure the quality of dialogue generation along two dimensions with the help of gold references: (1) The semantic closeness of generated response to the corresponding gold references; (2) The syntactic quality of the sentence construction. However, the original formulation of both adequacy and fluency metrics face some technical limitations. The latent semantic indexing (LSI) approach to AM modeling is not scalable to large amount of data. The bag-of-words representation of sentences fails to capture the contextual information. As for FM modeling, the n-gram language model implementation is not able to capture longterm dependency. Many deep learning approaches, such as the long short-term memory network (LSTM) or transformer-based architectures, are able to address these issues well by providing better contextual-aware sentence representations than the LSI approach and achieving much lower perplexity on benchmarking datasets as compared to the n-gram language model. In this paper, we propose deep AM-FM, a DNN-based implementation of the framework and demonstrate that it achieves promising improvements in both Pearson and Spearman correlation w.r.t human evaluation on the bench-marking DSTC6 End-to-end Conversation Modeling task as compared to its original implementation and other popular automatic metrics.
URI: https://scholarbank.nus.edu.sg/handle/10635/166668
Rights: CC0 1.0 Universal
Appears in Collections:Staff Publications
Students Publications
Elements

Show full item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
IWSDS_2020_paper_11.pdf275.25 kBAdobe PDF

OPEN

PublishedView/Download

Google ScholarTM

Check


This item is licensed under a Creative Commons License Creative Commons