Enhancing Template Extraction accuracy of Heterogenous Web Documents

Main Article Content

Mr.S.Sathees Babu

Abstract

Countless websites contain large set of pages generated using the common templates with contents. Due to the extraneous terms in templates, they degrade the accuracy and performance of web applications. Thus, template detection techniques have received a lot of attention recently to enhance the performance of web applications such as search engines, clustering, and classification. Thus, in order to prevent the duplication in the templates, nowadays we handle them with some detection techniques. In this paper, we present techniques for automatically cropping clusters based on MDL cost that can be used to extract search result records from dynamically generated web documents. Thus, we don’t need additional template extraction process after clustering. Experimental results show that our proposed approach is feasible and effect for improving extraction accuracy.

 

Keywords: Minimum Description Length (MDL), template extraction, MinHash, Max Algorithm, dice algorithm, clustering

Downloads

Download data is not yet available.

Article Details

Section
Articles