Please use this identifier to cite or link to this item: https://scholarbank.nus.edu.sg/handle/10635/166668
DC FieldValue
dc.titleDeep AM-FM: Toolkit for Automatic Dialogue Evaluation
dc.contributor.authorChen Zhang
dc.contributor.authorLuis Fernando D'Haro
dc.contributor.authorRafael E. Banchs
dc.contributor.authorThomas Friedrichs
dc.contributor.authorHaizhou Li
dc.date.accessioned2020-04-17T02:26:36Z
dc.date.available2020-04-17T02:26:36Z
dc.date.issued2020-04-17
dc.identifier.citationChen Zhang, Luis Fernando D'Haro, Rafael E. Banchs, Thomas Friedrichs, Haizhou Li (2020-04-17). Deep AM-FM: Toolkit for Automatic Dialogue Evaluation. ScholarBank@NUS Repository.
dc.identifier.urihttps://scholarbank.nus.edu.sg/handle/10635/166668
dc.description.abstractThere have been many studies on human-machine dialogue systems. To evaluate them accurately and fairly, many resort to human grading of system outputs. Unfortunately, this is time-consuming and expensive. The study of AM-FM (Adequacy Metric - Fluency Metric) suggests an automatic evaluation metric, that achieves good performance in terms of correlation with human judgements. AMFM framework intends to measure the quality of dialogue generation along two dimensions with the help of gold references: (1) The semantic closeness of generated response to the corresponding gold references; (2) The syntactic quality of the sentence construction. However, the original formulation of both adequacy and fluency metrics face some technical limitations. The latent semantic indexing (LSI) approach to AM modeling is not scalable to large amount of data. The bag-of-words representation of sentences fails to capture the contextual information. As for FM modeling, the n-gram language model implementation is not able to capture longterm dependency. Many deep learning approaches, such as the long short-term memory network (LSTM) or transformer-based architectures, are able to address these issues well by providing better contextual-aware sentence representations than the LSI approach and achieving much lower perplexity on benchmarking datasets as compared to the n-gram language model. In this paper, we propose deep AM-FM, a DNN-based implementation of the framework and demonstrate that it achieves promising improvements in both Pearson and Spearman correlation w.r.t human evaluation on the bench-marking DSTC6 End-to-end Conversation Modeling task as compared to its original implementation and other popular automatic metrics.
dc.publisherSpringer
dc.rightsCC0 1.0 Universal
dc.rights.urihttp://creativecommons.org/publicdomain/zero/1.0/
dc.typeBook Chapter
dc.contributor.departmentELECTRICAL AND COMPUTER ENGINEERING
dc.published.statePublished
dc.grant.idAISG-GC-2019-002
Appears in Collections:Staff Publications
Students Publications
Elements

Show simple item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
IWSDS_2020_paper_11.pdf275.25 kBAdobe PDF

OPEN

PublishedView/Download

Google ScholarTM

Check


This item is licensed under a Creative Commons License Creative Commons