Please use this identifier to cite or link to this item: https://doi.org/10.1145/3308558.3313598
DC FieldValue
dc.titleNeural Multimodal Belief Tracker with Adaptive Attention for Dialogue Systems
dc.contributor.authorZheng Zhang
dc.contributor.authorLizi Liao
dc.contributor.authorMinlie Huang
dc.contributor.authorXiaoyan Zhu
dc.contributor.authorTat-Seng Chua
dc.date.accessioned2020-04-28T04:18:47Z
dc.date.available2020-04-28T04:18:47Z
dc.date.issued2019-05-13
dc.identifier.citationZheng Zhang, Lizi Liao, Minlie Huang, Xiaoyan Zhu, Tat-Seng Chua (2019-05-13). Neural Multimodal Belief Tracker with Adaptive Attention for Dialogue Systems. WWW 2019 : 2401-2412. ScholarBank@NUS Repository. https://doi.org/10.1145/3308558.3313598
dc.identifier.isbn9781450366748
dc.identifier.urihttps://scholarbank.nus.edu.sg/handle/10635/167317
dc.description.abstractMultimodal dialogue systems are attracting increasing attention with a more natural and informative way for human-computer interaction. As one of its core components, the belief tracker estimates the user's goal at each step of the dialogue and provides a direct way to validate the ability of dialogue understanding. However, existing studies on belief trackers are largely limited to textual modality, which cannot be easily extended to capture the rich semantics in multimodal systems such as those with product images. For example, in fashion domain, the visual appearance of clothes play a crucial role in understanding the user's intention. In this case, the existing belief trackers may fail to generate accurate belief states for a multimodal dialogue system. In this paper, we present the first neural multimodal belief tracker (NMBT) to demonstrate how multimodal evidence can facilitate semantic understanding and dialogue state tracking. Given the multimodal inputs, while applying a textual encoder to represent textual utterances, the model gives special consideration to the semantics revealed in visual modality. It learns concept level fashion semantics by delving deep into image sub-regions and integrating concept probabilities via multiple instance learning. Then in each turn, an adaptive attention mechanism learns to automatically emphasize on different evidence sources of both visual and textual modalities for more accurate dialogue state prediction. We perform extensive evaluation on a multi-turn task-oriented dialogue dataset in fashion domain and the results show that our method achieves superior performance as compared to a wide range of baselines. © 2019 IW3C2 (International World Wide Web Conference Committee), published under Creative Commons CC-BY 4.0 License.
dc.publisherAssociation for Computing Machinery, Inc
dc.rightsAttribution 4.0 International
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.typeConference Paper
dc.contributor.departmentDEPT OF COMPUTER SCIENCE
dc.description.doi10.1145/3308558.3313598
dc.description.sourcetitleWWW 2019
dc.description.page2401-2412
dc.grant.idR-252-300-002-490
dc.grant.fundingagencyInfocomm Media Development Authority
dc.grant.fundingagencyNational Research Foundation
Appears in Collections:Elements
Staff Publications

Show simple item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
Neural Multimodal Belief Tracker with Adaptive Attention for Dialogue Systems.pdf4.03 MBAdobe PDF

OPEN

NoneView/Download

SCOPUSTM   
Citations

18
checked on Aug 7, 2022

WEB OF SCIENCETM
Citations

7
checked on Oct 5, 2021

Page view(s)

202
checked on Aug 4, 2022

Download(s)

1
checked on Aug 4, 2022

Google ScholarTM

Check

Altmetric


This item is licensed under a Creative Commons License Creative Commons