Please use this identifier to cite or link to this item: http://scholarbank.nus.edu.sg/handle/10635/14509
Title: Web page cleaning for web mining
Authors: YI LAN
Keywords: Web Mining, Web Page Cleaning, Web Page Structure, Web Page Presentation Style
Issue Date: 1-Feb-2005
Source: YI LAN (2005-02-01). Web page cleaning for web mining. ScholarBank@NUS Repository.
Abstract: This thesis focuses on the problem of Web page cleaning, i.e., the pre-processing of Web pages to automatically detect and eliminate noises for Web mining. We propose the SST based method and the features weighting method do Web page cleaning to automatically/semi-automatically. Both the methods are based on the observation that: in a given Web site, noisy blocks of a Web page usually share some common contents and/or presentation styles, while the main content blocks of the page are often diverse in their actual contents and presentation styles. The SST based method builds a site style tree (SST) to capture the actual contents and the presentation styles of the Web pages in a given Web site. An information based measure is introduced to determine which parts of the SST represent noises and which parts represent the main contents of the site. The SST is then employed to detect and eliminate noises of a Web page in the site by mapping this page to the SST. As an improvement of the SST based method, the feature weighting method builds a compressed structure tree (CST) for a given Web site and also uses an information based measure to weight features in the CST. The resulting features and their corresponding accumulated weights are used for Web mining tasks.
URI: http://scholarbank.nus.edu.sg/handle/10635/14509
Appears in Collections:Ph.D Theses (Open)

Show full item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
thesis_yilan.pdf1.31 MBAdobe PDF

OPEN

NoneView/Download

Page view(s)

358
checked on Dec 11, 2017

Download(s)

376
checked on Dec 11, 2017

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.