A NOVEL TWO-PHASE PAGE FEATURE AND KTH KEYPHRASE FINGERPRINT BASED DUPLICATE DETECTION TECHNIQUE

Ashlesha Gupta; Ashutosh Dixit; A.K. Sharma

doi:10.26483/ijarcs.v9i1.5352

PDF

Published: Feb 23, 2018

DOI: https://doi.org/10.26483/ijarcs.v9i1.5352

Keywords:

Duplicate page, Near-duplicate page, Filtering, Finger-print, Page-features

Ashlesha Gupta

YMCA University of Science and Technology, Faridabad

Ashutosh Dixit

A.K. Sharma

Abstract

The World Wide Web is a huge repository of network-accessible information including text, image, audio, video and metadata. With rapid increase in information resources available via WWW and users of the Internet, it is becoming difficult to manage and access the desired information on the web. Therefore, majority of users use information retrieval tools like search engines to find the desired information from the WWW. Web search engines work by storing information about many web pages, which they retrieve from the WWW itself.. Many of the pages stored in search engine repository are duplicates and near duplicates of other pages. These duplicate and near duplicate web pages require more space for storage, which increase the cost of serving results and also frustrates the users. To help search engines provide quality and redundant free, distinct results duplicate and near duplicate detection algorithms are used. The proposed duplicate detection approach detects near duplicate web pages efficiently and quickly thereby improving search effectiveness and storage efficiency of search engine.

Downloads

Download data is not yet available.

Issue

Vol. 9 No. 1 (2018): January-February 2018

Section

Articles

COPYRIGHT

Submission of a manuscript implies: that the work described has not been published before, that it is not under consideration for publication elsewhere; that if and when the manuscript is accepted for publication, the authors agree to automatic transfer of the copyright to the publisher.

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work
The journal allows the author(s) to retain publishing rights without restrictions.
The journal allows the author(s) to hold the copyright without restrictions.

References

J Prasanna Kumar, P Govindarajulu ,â€œDuplicate and Near Duplicate Documents Detection: A Reviewâ€ European Journal of Scientific Research ISSN 1450-216X Vol.32 No.4, pp.514-527,2009

Bassma S. Alsulami, Maysoon F. Abulkhair, Fathy E. Eassa, â€œNear Duplicate Document Detection Survey",International Journal of Computer Science & Communication Networks,Vol 2(2), 147-151,2010

I. S. Jacobs and C. P. Bean, â€œFine particles, thin films and exchange anisotropy,â€ in Magnetism, vol. III, G. T. Rado and H. Suhl, Eds. New York: Academic, 1963, pp. 271â€“350.

Midhun Mathew, Shine N Das, T R Lakshmi Narayanan, Pramod K Vijayaraghavan, â€œA Novel Approach for Near-Duplicate Detection of Web Pages using TDW Matrixâ€, International Journal of Computer Applications (0975 â€“ 8887)

A. Broder, S. Glassman, M. Manasse and G. Zweig, â€œSyntactic clustering of the webâ€, In Proc. of the 6th International World Wide Web Conference, Apr. 1997

Zahra Eskandari Gharghe, Behrouz Minaei Bidgoli,"Weighted shingling: an adaptation of shingling for weighted shingles",2009 IEEE

Junping Qiu and Qian Zeng, Detection and Optimized Disposal of NearDuplicate Pages, 2nd International Conference on Future Computer and Communication, Vol.2, pp: 604-607, 2010.

V.A. Narayana, P. Premchand and A. Govardhan, â€œEffective Detection of Near-Duplicate Web Documents in Web Crawlingâ€, International Journal of Computational Intelligence Research, Volume 5, Number 1, pp. 83â€“96, 2009.

Salha Alzahrani, Naomie Salim, â€œFuzzy Semantic-Based String Similarity for Extrinsic Plagiarism Detection Lab Report for PAN at CLEFâ€, 2010

Chuan Xiao, Wei Wang, Xuemin Lin, Efficient Similarity Joins for Near Duplicate Detection, Proceeding of the 17th international conference on World Wide Web, pp 131 â€“ 140. April 2008.

Yun Ling, Xiaobo Tao Hexin Lv, A Priority-Based Method Of Near duplicated Text Information Of Web Pages Deletion, IEEE International Conference on Software Engineering and Service Sciences (ICSESS), August 2010.

N.Joshi, J.Gagde, Near Duplicate Web Detection Using NDupDet Algorithm, International Journal of Computer Applications , Volume 61, No.4, Jan2013

Fetterly, D., Manasse, M. and Najork, M. On the evolution of clusters of near duplicate web pages, In Proceedings of the first Latin American Web Congress (LAWeb), 37â€“45, 2003.

Article Sidebar

Main Article Content

Abstract

Downloads

Article Details

References