Please use this identifier to cite or link to this item: http://scholarbank.nus.edu.sg/handle/10635/41416
Title: Towards understanding the functions of web element
Authors: Yin, X.
Lee, W.S. 
Issue Date: 2005
Source: Yin, X.,Lee, W.S. (2005). Towards understanding the functions of web element. Lecture Notes in Computer Science 3411 : 313-324. ScholarBank@NUS Repository.
Abstract: A web page is a collection of basic elements, and the role of each element in a page is different. For example, an image element can be part of the main content, advertisement, or banner of the site. This paper describes ongoing work using a machine learning approach to classify each element in a web page into six functional categories: Content (C), Related Link (R), Navigation (N), Advertisement (A), Form (F) and Other (O). This allows the extraction of only certain categories of content in a webpage to be delivered to a mobile device to fit user's specific needs, or to facilitate web information processes like web mining or mobile search. We manually labeled 18,864 elements from 150 websites. For each element we extracted both local features (such as the text length, URL, tag name etc) and global features (such as the text match with the other elements) to construct a feature vector. We trained the training set 10,650 elements with a decision tree learning algorithm J48, and it achieved 82% accuracy for stratified cross-validation, and an average F value 0.78 for the six different categories. Testing on 3,043 elements from pages that are not included in the training set gives 58% accuracy rate. Although this is not satisfactory overall, the F value for content category reaches 0.795, indicating that the method could be useful for less demanding applications. We are working on improving the results in order to make automatic functional classification of web elements feasible and to provide new opportunities to push the state of art in the mobile internet and mobile search. © Springer-Verlag Berlin Heidelberg 2005.
Source Title: Lecture Notes in Computer Science
URI: http://scholarbank.nus.edu.sg/handle/10635/41416
ISSN: 03029743
Appears in Collections:Staff Publications

Show full item record
Files in This Item:
There are no files associated with this item.

Page view(s)

32
checked on Dec 9, 2017

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.