Please use this identifier to cite or link to this item: https://doi.org/10.1145/3397271.3401107
DC FieldValue
dc.titleEnhancing Text Classification via Discovering Additional Semantic Clues from Logograms
dc.contributor.authorChen Qian
dc.contributor.authorFENG FULI
dc.contributor.authorLijie Wen
dc.contributor.authorLi Lin
dc.contributor.authorCHUA TAT SENG
dc.date.accessioned2020-11-13T05:47:37Z
dc.date.available2020-11-13T05:47:37Z
dc.date.issued2020-07-25
dc.identifier.citationChen Qian, FENG FULI, Lijie Wen, Li Lin, CHUA TAT SENG (2020-07-25). Enhancing Text Classification via Discovering Additional Semantic Clues from Logograms. SIGIR 2020. ScholarBank@NUS Repository. https://doi.org/10.1145/3397271.3401107
dc.identifier.urihttps://scholarbank.nus.edu.sg/handle/10635/183428
dc.description.abstractText classification in low-resource languages (e.g., Thai) is of great practical value for some information retrieval applications (e.g., sentiment-analysis-based restaurant recommendation). Due to lacking large-scale corpus for learning comprehensive text representation, bilingual text classification which borrows the linguistics knowledge from a rich-resource language becomes a promising solution. Despite the success of bilingual methods, they largely ignore another source of semantic information—the writing system. Noting that most low-resource languages are phonographic languages, we argue that a logographic language (e.g., Chinese) can provide helpful information for improving some phonographic languages’ text classification, since a logographic character (i.e., logogram) could represent a sememe or a whole concept, not only a phoneme or a sound. In this paper, by using a phonographic labeled corpus and its machine-translated logographic corpus both, we devise a framework to explore the central theme of utilizing logograms as a “semantic detection assistant”. Specifically, from a logographic labeled corpus, we first devise a statistical-significance-based module to pick out informative text pieces. To represent them and further reduce the effects of translation errors, our approach is equipped with Gaussian embedding whose covariances serve as reliable signals of translation errors. For a test document, all seeds’ Gaussian representations are used to convolute the document and produce a logographic embedding, before being fused with its phonographic embedding for final prediction. Extensive experiments validate the effectiveness of our approach and further investigations show its generalizability and robustness.
dc.language.isoen
dc.publisherSIGIR 2020
dc.rightsAttribution 4.0 International
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subjectBilingual Text Classification
dc.subjectWriting System
dc.subjectLogogram
dc.subjectMachine Translation
dc.subjectGaussian Embedding
dc.typeConference Paper
dc.contributor.departmentCOMPUTATIONAL SCIENCE
dc.description.doi10.1145/3397271.3401107
dc.description.sourcetitleSIGIR 2020
dc.published.statePublished
dc.grant.idR-252-300-002-490
dc.grant.fundingagencyIMDA
dc.grant.fundingagencyNational Research Foundations
Appears in Collections:Staff Publications
Elements

Show simple item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
Enhancing Text Classification via Discovering Additional Semantic Clues from Logograms.pdf2.42 MBAdobe PDF

OPEN

NoneView/Download

Google ScholarTM

Check

Altmetric


This item is licensed under a Creative Commons License Creative Commons