A NOVEL TWO-PHASE PAGE FEATURE AND KTH KEYPHRASE FINGERPRINT BASED DUPLICATE DETECTION TECHNIQUE

Ashlesha Gupta, Ashutosh Dixit, A.K. Sharma

Abstract


The World Wide Web is a huge repository of network-accessible information including text, image, audio, video and metadata. With rapid increase in information resources available via WWW and users of the Internet, it is becoming difficult to manage and access the desired information on the web. Therefore, majority of users use information retrieval tools like search engines to find the desired information from the WWW. Web search engines work by storing information about many web pages, which they retrieve from the WWW itself.. Many of the pages stored in search engine repository are duplicates and near duplicates of other pages. These duplicate and near duplicate web pages require more space for storage, which increase the cost of serving results and also frustrates the users. To help search engines provide quality and redundant free, distinct results duplicate and near duplicate detection algorithms are used. The proposed duplicate detection approach detects near duplicate web pages efficiently and quickly thereby improving search effectiveness and storage efficiency of search engine.

Keywords


Duplicate page;Near-duplicate page; Filtering; Finger-print; Page-features

Full Text:

PDF

References


J Prasanna Kumar, P Govindarajulu ,“Duplicate and Near Duplicate Documents Detection: A Review” European Journal of Scientific Research ISSN 1450-216X Vol.32 No.4, pp.514-527,2009

Bassma S. Alsulami, Maysoon F. Abulkhair, Fathy E. Eassa, “Near Duplicate Document Detection Survey",International Journal of Computer Science & Communication Networks,Vol 2(2), 147-151,2010

I. S. Jacobs and C. P. Bean, “Fine particles, thin films and exchange anisotropy,” in Magnetism, vol. III, G. T. Rado and H. Suhl, Eds. New York: Academic, 1963, pp. 271–350.

Midhun Mathew, Shine N Das, T R Lakshmi Narayanan, Pramod K Vijayaraghavan, “A Novel Approach for Near-Duplicate Detection of Web Pages using TDW Matrix”, International Journal of Computer Applications (0975 – 8887)

A. Broder, S. Glassman, M. Manasse and G. Zweig, “Syntactic clustering of the web”, In Proc. of the 6th International World Wide Web Conference, Apr. 1997

Zahra Eskandari Gharghe, Behrouz Minaei Bidgoli,"Weighted shingling: an adaptation of shingling for weighted shingles",2009 IEEE

Junping Qiu and Qian Zeng, Detection and Optimized Disposal of NearDuplicate Pages, 2nd International Conference on Future Computer and Communication, Vol.2, pp: 604-607, 2010.

V.A. Narayana, P. Premchand and A. Govardhan, “Effective Detection of Near-Duplicate Web Documents in Web Crawling”, International Journal of Computational Intelligence Research, Volume 5, Number 1, pp. 83–96, 2009.

Salha Alzahrani, Naomie Salim, “Fuzzy Semantic-Based String Similarity for Extrinsic Plagiarism Detection Lab Report for PAN at CLEF”, 2010

Chuan Xiao, Wei Wang, Xuemin Lin, Efficient Similarity Joins for Near Duplicate Detection, Proceeding of the 17th international conference on World Wide Web, pp 131 – 140. April 2008.

Yun Ling, Xiaobo Tao Hexin Lv, A Priority-Based Method Of Near duplicated Text Information Of Web Pages Deletion, IEEE International Conference on Software Engineering and Service Sciences (ICSESS), August 2010.

N.Joshi, J.Gagde, Near Duplicate Web Detection Using NDupDet Algorithm, International Journal of Computer Applications , Volume 61, No.4, Jan2013

Fetterly, D., Manasse, M. and Najork, M. On the evolution of clusters of near duplicate web pages, In Proceedings of the first Latin American Web Congress (LAWeb), 37–45, 2003.




DOI: https://doi.org/10.26483/ijarcs.v9i1.5352

Refbacks

  • There are currently no refbacks.




Copyright (c) 2018 International Journal of Advanced Research in Computer Science