Ashlesha Gupta, Ashutosh Dixit, A.K. Sharma


The World Wide Web is a huge repository of network-accessible information including text, image, audio, video and metadata. With rapid increase in information resources available via WWW and users of the Internet, it is becoming difficult to manage and access the desired information on the web. Therefore, majority of users use information retrieval tools like search engines to find the desired information from the WWW. Web search engines work by storing information about many web pages, which they retrieve from the WWW itself.. Many of the pages stored in search engine repository are duplicates and near duplicates of other pages. These duplicate and near duplicate web pages require more space for storage, which increase the cost of serving results and also frustrates the users. To help search engines provide quality and redundant free, distinct results duplicate and near duplicate detection algorithms are used. The proposed duplicate detection approach detects near duplicate web pages efficiently and quickly thereby improving search effectiveness and storage efficiency of search engine.


Duplicate page;Near-duplicate page; Filtering; Finger-print; Page-features

Full Text:



DOI: https://doi.org/10.26483/ijarcs.v9i1.5352


