Please use this identifier to cite or link to this item:
https://scholarbank.nus.edu.sg/handle/10635/166668
Title: | Deep AM-FM: Toolkit for Automatic Dialogue Evaluation | Authors: | Chen Zhang Luis Fernando D'Haro Rafael E. Banchs Thomas Friedrichs Haizhou Li |
Issue Date: | 17-Apr-2020 | Publisher: | Springer | Citation: | Chen Zhang, Luis Fernando D'Haro, Rafael E. Banchs, Thomas Friedrichs, Haizhou Li (2020-04-17). Deep AM-FM: Toolkit for Automatic Dialogue Evaluation. ScholarBank@NUS Repository. | Rights: | CC0 1.0 Universal | Abstract: | There have been many studies on human-machine dialogue systems. To evaluate them accurately and fairly, many resort to human grading of system outputs. Unfortunately, this is time-consuming and expensive. The study of AM-FM (Adequacy Metric - Fluency Metric) suggests an automatic evaluation metric, that achieves good performance in terms of correlation with human judgements. AMFM framework intends to measure the quality of dialogue generation along two dimensions with the help of gold references: (1) The semantic closeness of generated response to the corresponding gold references; (2) The syntactic quality of the sentence construction. However, the original formulation of both adequacy and fluency metrics face some technical limitations. The latent semantic indexing (LSI) approach to AM modeling is not scalable to large amount of data. The bag-of-words representation of sentences fails to capture the contextual information. As for FM modeling, the n-gram language model implementation is not able to capture longterm dependency. Many deep learning approaches, such as the long short-term memory network (LSTM) or transformer-based architectures, are able to address these issues well by providing better contextual-aware sentence representations than the LSI approach and achieving much lower perplexity on benchmarking datasets as compared to the n-gram language model. In this paper, we propose deep AM-FM, a DNN-based implementation of the framework and demonstrate that it achieves promising improvements in both Pearson and Spearman correlation w.r.t human evaluation on the bench-marking DSTC6 End-to-end Conversation Modeling task as compared to its original implementation and other popular automatic metrics. | URI: | https://scholarbank.nus.edu.sg/handle/10635/166668 | Rights: | CC0 1.0 Universal |
Appears in Collections: | Staff Publications Students Publications Elements |
Show full item record
Files in This Item:
File | Description | Size | Format | Access Settings | Version | |
---|---|---|---|---|---|---|
IWSDS_2020_paper_11.pdf | 275.25 kB | Adobe PDF | OPEN | Published | View/Download |
Google ScholarTM
Check
This item is licensed under a Creative Commons License