Deep AM-FM: Toolkit for Automatic Dialogue Evaluation | ScholarBank@NUS

Please use this identifier to cite or link to this item: https://scholarbank.nus.edu.sg/handle/10635/166668

Title:	Deep AM-FM: Toolkit for Automatic Dialogue Evaluation
Authors:	Chen Zhang Luis Fernando D'Haro Rafael E. Banchs Thomas Friedrichs Haizhou Li
Issue Date:	17-Apr-2020
Publisher:	Springer
Citation:	Chen Zhang, Luis Fernando D'Haro, Rafael E. Banchs, Thomas Friedrichs, Haizhou Li (2020-04-17). Deep AM-FM: Toolkit for Automatic Dialogue Evaluation. ScholarBank@NUS Repository.
Rights:	CC0 1.0 Universal
Abstract:	There have been many studies on human-machine dialogue systems. To evaluate them accurately and fairly, many resort to human grading of system outputs. Unfortunately, this is time-consuming and expensive. The study of AM-FM (Adequacy Metric - Fluency Metric) suggests an automatic evaluation metric, that achieves good performance in terms of correlation with human judgements. AMFM framework intends to measure the quality of dialogue generation along two dimensions with the help of gold references: (1) The semantic closeness of generated response to the corresponding gold references; (2) The syntactic quality of the sentence construction. However, the original formulation of both adequacy and fluency metrics face some technical limitations. The latent semantic indexing (LSI) approach to AM modeling is not scalable to large amount of data. The bag-of-words representation of sentences fails to capture the contextual information. As for FM modeling, the n-gram language model implementation is not able to capture longterm dependency. Many deep learning approaches, such as the long short-term memory network (LSTM) or transformer-based architectures, are able to address these issues well by providing better contextual-aware sentence representations than the LSI approach and achieving much lower perplexity on benchmarking datasets as compared to the n-gram language model. In this paper, we propose deep AM-FM, a DNN-based implementation of the framework and demonstrate that it achieves promising improvements in both Pearson and Spearman correlation w.r.t human evaluation on the bench-marking DSTC6 End-to-end Conversation Modeling task as compared to its original implementation and other popular automatic metrics.
URI:	https://scholarbank.nus.edu.sg/handle/10635/166668
Rights:	CC0 1.0 Universal
Appears in Collections:	Staff Publications Students Publications Elements

Show full item record

Files in This Item:

File	Description	Size	Format	Access Settings	Version
IWSDS_2020_paper_11.pdf		275.25 kB	Adobe PDF	OPEN	Published	View/Download

Google Scholar^TM

Check

This item is licensed under a Creative Commons License