Please use this identifier to cite or link to this item: https://doi.org/10.18653/v1/2020.emnlp-main.455
DC FieldValue
dc.titleMind your inflections! Improving NLP for non-standard Englishes with base-inflection encoding
dc.contributor.authorTan, S
dc.contributor.authorJoty, S
dc.contributor.authorVarshney, LR
dc.contributor.authorKan, MY
dc.date.accessioned2022-07-30T07:52:30Z
dc.date.available2022-07-30T07:52:30Z
dc.date.issued2020-01-01
dc.identifier.citationTan, S, Joty, S, Varshney, LR, Kan, MY (2020-01-01). Mind your inflections! Improving NLP for non-standard Englishes with base-inflection encoding. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) : 5647-5663. ScholarBank@NUS Repository. https://doi.org/10.18653/v1/2020.emnlp-main.455
dc.identifier.isbn9781952148606
dc.identifier.urihttps://scholarbank.nus.edu.sg/handle/10635/229531
dc.description.abstractInflectional variation is a common feature of World Englishes such as Colloquial Singapore English and African American Vernacular English. Although comprehension by human readers is usually unimpaired by nonstandard inflections, current NLP systems are not yet robust. We propose Base-Inflection Encoding (BITE), a method to tokenize English text by reducing inflected words to their base forms before reinjecting the grammatical information as special symbols. Fine-tuning pretrained NLP models for downstream tasks using our encoding defends against inflectional adversaries while maintaining performance on clean data. Models using BITE generalize better to dialects with non-standard inflections without explicit training and translation models converge faster when trained with BITE. Finally, we show that our encoding improves the vocabulary efficiency of popular data-driven subword tokenizers. Since there has been no prior work on quantitatively evaluating vocabulary efficiency, we propose metrics to do so.
dc.publisherAssociation for Computational Linguistics
dc.sourceElements
dc.typeConference Paper
dc.date.updated2022-07-19T07:44:18Z
dc.contributor.departmentDEPARTMENT OF COMPUTER SCIENCE
dc.description.doi10.18653/v1/2020.emnlp-main.455
dc.description.sourcetitleProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
dc.description.page5647-5663
dc.published.statePublished
Appears in Collections:Staff Publications
Elements

Show simple item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
2020.emnlp-main.455.pdf500.72 kBAdobe PDF

OPEN

PublishedView/Download

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.