Enhancing Text Classification via Discovering Additional Semantic Clues from Logograms

Please use this identifier to cite or link to this item: https://doi.org/10.1145/3397271.3401107

DC Field	Value
dc.title	Enhancing Text Classification via Discovering Additional Semantic Clues from Logograms
dc.contributor.author	Chen Qian
dc.contributor.author	FENG FULI
dc.contributor.author	Lijie Wen
dc.contributor.author	Li Lin
dc.contributor.author	CHUA TAT SENG
dc.date.accessioned	2020-11-13T05:47:37Z
dc.date.available	2020-11-13T05:47:37Z
dc.date.issued	2020-07-25
dc.identifier.citation	Chen Qian, FENG FULI, Lijie Wen, Li Lin, CHUA TAT SENG (2020-07-25). Enhancing Text Classification via Discovering Additional Semantic Clues from Logograms. SIGIR 2020. ScholarBank@NUS Repository. https://doi.org/10.1145/3397271.3401107
dc.identifier.uri	https://scholarbank.nus.edu.sg/handle/10635/183428
dc.description.abstract	Text classification in low-resource languages (e.g., Thai) is of great practical value for some information retrieval applications (e.g., sentiment-analysis-based restaurant recommendation). Due to lacking large-scale corpus for learning comprehensive text representation, bilingual text classification which borrows the linguistics knowledge from a rich-resource language becomes a promising solution. Despite the success of bilingual methods, they largely ignore another source of semantic information—the writing system. Noting that most low-resource languages are phonographic languages, we argue that a logographic language (e.g., Chinese) can provide helpful information for improving some phonographic languages’ text classification, since a logographic character (i.e., logogram) could represent a sememe or a whole concept, not only a phoneme or a sound. In this paper, by using a phonographic labeled corpus and its machine-translated logographic corpus both, we devise a framework to explore the central theme of utilizing logograms as a “semantic detection assistant”. Specifically, from a logographic labeled corpus, we first devise a statistical-significance-based module to pick out informative text pieces. To represent them and further reduce the effects of translation errors, our approach is equipped with Gaussian embedding whose covariances serve as reliable signals of translation errors. For a test document, all seeds’ Gaussian representations are used to convolute the document and produce a logographic embedding, before being fused with its phonographic embedding for final prediction. Extensive experiments validate the effectiveness of our approach and further investigations show its generalizability and robustness.
dc.language.iso	en
dc.publisher	SIGIR 2020
dc.rights	Attribution 4.0 International
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.subject	Bilingual Text Classification
dc.subject	Writing System
dc.subject	Logogram
dc.subject	Machine Translation
dc.subject	Gaussian Embedding
dc.type	Conference Paper
dc.contributor.department	COMPUTATIONAL SCIENCE
dc.description.doi	10.1145/3397271.3401107
dc.description.sourcetitle	SIGIR 2020
dc.published.state	Published
dc.grant.id	R-252-300-002-490
dc.grant.fundingagency	IMDA
dc.grant.fundingagency	National Research Foundations
Appears in Collections:	Staff Publications Elements

Show simple item record

Files in This Item:

File	Description	Size	Format	Access Settings	Version
Enhancing Text Classification via Discovering Additional Semantic Clues from Logograms.pdf		2.42 MB	Adobe PDF	OPEN	None	View/Download

Google Scholar^TM

Check

Altmetric

This item is licensed under a Creative Commons License

Files in This Item:

Google ScholarTM

Altmetric

Google Scholar^TM