Enhancing Template Extraction accuracy of Heterogenous Web Documents

Mr.S.Sathees Babu

Abstract


Countless websites contain large set of pages generated using the common templates with contents. Due to the extraneous terms in templates, they degrade the accuracy and performance of web applications. Thus, template detection techniques have received a lot of attention recently to enhance the performance of web applications such as search engines, clustering, and classification. Thus, in order to prevent the duplication in the templates, nowadays we handle them with some detection techniques. In this paper, we present techniques for automatically cropping clusters based on MDL cost that can be used to extract search result records from dynamically generated web documents. Thus, we don’t need additional template extraction process after clustering. Experimental results show that our proposed approach is feasible and effect for improving extraction accuracy.

 

Keywords: Minimum Description Length (MDL), template extraction, MinHash, Max Algorithm, dice algorithm, clustering


Full Text:

PDF


DOI: https://doi.org/10.26483/ijarcs.v3i4.1265

Refbacks

  • There are currently no refbacks.




Copyright (c) 2016 International Journal of Advanced Research in Computer Science