Molweni: A Challenge Multiparty Dialogues-based Machine Reading Comprehension Dataset with Discourse Structure

Please use this identifier to cite or link to this item: https://doi.org/10.18653/v1/2020.coling-main.238

DC Field	Value
dc.title	Molweni: A Challenge Multiparty Dialogues-based Machine Reading Comprehension Dataset with Discourse Structure
dc.contributor.author	Li, Jiaqi
dc.contributor.author	Liu, Ming
dc.contributor.author	Kan, Min-Yen
dc.contributor.author	Zheng, Zihao
dc.contributor.author	Wang, Zekun
dc.contributor.author	Lei, Wenqiang
dc.contributor.author	Liu, Ting
dc.contributor.author	Qin, Bing
dc.date.accessioned	2021-07-22T06:45:00Z
dc.date.available	2021-07-22T06:45:00Z
dc.date.issued	2020
dc.identifier.citation	Li, Jiaqi, Liu, Ming, Kan, Min-Yen, Zheng, Zihao, Wang, Zekun, Lei, Wenqiang, Liu, Ting, Qin, Bing (2020). Molweni: A Challenge Multiparty Dialogues-based Machine Reading Comprehension Dataset with Discourse Structure. Proceedings of the 28th International Conference on Computational Linguistics abs/2004.05080. ScholarBank@NUS Repository. https://doi.org/10.18653/v1/2020.coling-main.238
dc.identifier.uri	https://scholarbank.nus.edu.sg/handle/10635/194749
dc.description.abstract	Research into the area of multiparty dialog has grown considerably over recent years. We present the Molweni dataset, a machine reading comprehension (MRC) dataset with discourse structure built over multiparty dialog. Molweni's source samples from the Ubuntu Chat Corpus, including 10,000 dialogs comprising 88,303 utterances. We annotate 30,066 questions on this corpus, including both answerable and unanswerable questions. Molweni also uniquely contributes discourse dependency annotations in a modified Segmented Discourse Representation Theory (SDRT; Asher et al., 2016) style for all of its multiparty dialogs, contributing large-scale (78,245 annotated discourse relations) data to bear on the task of multiparty dialog discourse parsing. Our experiments show that Molweni is a challenging dataset for current MRC models: BERT-wwm, a current, strong SQuAD 2.0 performer, achieves only 67.7% F1 on Molweni's questions, a 20+% significant drop as compared against its SQuAD 2.0 performance.
dc.publisher	International Committee on Computational Linguistics
dc.source	Elements
dc.subject	cs.CL
dc.subject	cs.CL
dc.type	Article
dc.date.updated	2021-07-22T04:12:42Z
dc.contributor.department	DEPARTMENT OF COMPUTER SCIENCE
dc.description.doi	10.18653/v1/2020.coling-main.238
dc.description.sourcetitle	Proceedings of the 28th International Conference on Computational Linguistics
dc.description.volume	abs/2004.05080
dc.published.state	Published
Appears in Collections:	Staff Publications Elements

Show simple item record

Files in This Item:

File	Description	Size	Format	Access Settings	Version
2004.05080v3.pdf	Accepted version	555.19 kB	Adobe PDF	OPEN	Pre-print	View/Download

Google Scholar^TM

Check

Files in This Item:

Google ScholarTM

Altmetric

Google Scholar^TM