Please use this identifier to cite or link to this item: https://doi.org/10.1109/TPAMI.2007.1158
Title: Script and language identification in noisy and degraded document images
Authors: Shijian, L. 
Tan, C.L. 
Keywords: Association rules
Classification
Clustering
Document analysis
Language identification
Script identification
Shape
Issue Date: 2008
Citation: Shijian, L., Tan, C.L. (2008). Script and language identification in noisy and degraded document images. IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (1) : 14-24. ScholarBank@NUS Repository. https://doi.org/10.1109/TPAMI.2007.1158
Abstract: This paper reports an identification technique that detects scripts and languages of noisy and degraded document images. In the proposed technique, scripts and languages are identified through the document vectorization, which converts each document image into a document vector that characterizes the shape and frequency of the conta ned character or word images. Document images are vectorized by using vertical component cuts and character extremum points, which are both tolerant to the variation in text fonts and styles, noise, and various types of document degradation. For each script or language under study, a script or language template is first constructed through a training process. Scripts and languages of document images are then determined according to the distances between converted document vectors and the pre-constructed script and language templates. Experimental results show that the proposed technique is accurate, easy for extension, and tolerant to noise and various types of document degradation. © 2008 IEEE.
Source Title: IEEE Transactions on Pattern Analysis and Machine Intelligence
URI: http://scholarbank.nus.edu.sg/handle/10635/39073
ISSN: 01628828
DOI: 10.1109/TPAMI.2007.1158
Appears in Collections:Staff Publications

Show full item record
Files in This Item:
There are no files associated with this item.

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.