Deep AM-FM: Toolkit for Automatic Dialogue Evaluation

Please use this identifier to cite or link to this item: https://scholarbank.nus.edu.sg/handle/10635/166668

DC Field	Value
dc.title	Deep AM-FM: Toolkit for Automatic Dialogue Evaluation
dc.contributor.author	Chen Zhang
dc.contributor.author	Luis Fernando D'Haro
dc.contributor.author	Rafael E. Banchs
dc.contributor.author	Thomas Friedrichs
dc.contributor.author	Haizhou Li
dc.date.accessioned	2020-04-17T02:26:36Z
dc.date.available	2020-04-17T02:26:36Z
dc.date.issued	2020-04-17
dc.identifier.citation	Chen Zhang, Luis Fernando D'Haro, Rafael E. Banchs, Thomas Friedrichs, Haizhou Li (2020-04-17). Deep AM-FM: Toolkit for Automatic Dialogue Evaluation. ScholarBank@NUS Repository.
dc.identifier.uri	https://scholarbank.nus.edu.sg/handle/10635/166668
dc.description.abstract	There have been many studies on human-machine dialogue systems. To evaluate them accurately and fairly, many resort to human grading of system outputs. Unfortunately, this is time-consuming and expensive. The study of AM-FM (Adequacy Metric - Fluency Metric) suggests an automatic evaluation metric, that achieves good performance in terms of correlation with human judgements. AMFM framework intends to measure the quality of dialogue generation along two dimensions with the help of gold references: (1) The semantic closeness of generated response to the corresponding gold references; (2) The syntactic quality of the sentence construction. However, the original formulation of both adequacy and fluency metrics face some technical limitations. The latent semantic indexing (LSI) approach to AM modeling is not scalable to large amount of data. The bag-of-words representation of sentences fails to capture the contextual information. As for FM modeling, the n-gram language model implementation is not able to capture longterm dependency. Many deep learning approaches, such as the long short-term memory network (LSTM) or transformer-based architectures, are able to address these issues well by providing better contextual-aware sentence representations than the LSI approach and achieving much lower perplexity on benchmarking datasets as compared to the n-gram language model. In this paper, we propose deep AM-FM, a DNN-based implementation of the framework and demonstrate that it achieves promising improvements in both Pearson and Spearman correlation w.r.t human evaluation on the bench-marking DSTC6 End-to-end Conversation Modeling task as compared to its original implementation and other popular automatic metrics.
dc.publisher	Springer
dc.rights	CC0 1.0 Universal
dc.rights.uri	http://creativecommons.org/publicdomain/zero/1.0/
dc.type	Book Chapter
dc.contributor.department	ELECTRICAL AND COMPUTER ENGINEERING
dc.published.state	Published
dc.grant.id	AISG-GC-2019-002
Appears in Collections:	Staff Publications Students Publications Elements

Show simple item record

Files in This Item:

File	Description	Size	Format	Access Settings	Version
IWSDS_2020_paper_11.pdf		275.25 kB	Adobe PDF	OPEN	Published	View/Download

Google Scholar^TM

Check

This item is licensed under a Creative Commons License

Files in This Item:

Google ScholarTM

Google Scholar^TM