Document Clustering Using Cosine Similarity

Main Article Content

Ranjith Kumar N S
Keerthi K P
Prekshitha N
Prema S
Naveen Chandra Gowda

Abstract

Clustering or Cluster Analysis is a process of grouping similar objects in such a way that the objects in the group (cluster) are similar to each other than the objects in other groups (clusters). Clustering is an unsupervised machine learning technique where only the input data is served 3 (unlike as in supervised, a set of sample input and output pair is provided) to the system corresponding to which it recognizes a pattern and predicts the output automatically, Hence complete automation is achieved here. In specific to our work that is Document clustering is organizing the text files into clusters containing similar files (File Content). High precise clustering algorithms like K-means play an important role in data storage, data manipulation and information retrieval systems. Search engines like Google, Yahoo, Bing etc. uses Document clustering in addition to high-end processors and servers to retrieve the information in response to the various search queries. The most commonly used clustering technique is K-means, where the objects are divided into ‘k’ number of clusters with similar objects in it. The present work is focused on Document clustering using ‘Cosine Similarity’ where the pre-processing work is carried out by a readymade Java library known as ‘Apache Lucene’. The texts in the documents are broken down into strings, and the extracted strings is fed to the Apache Lucene which pre-processes the data, the number of repetitions of each word and gives the output as JSON objects. Then the cosine similarity is calculated with these indexed words. The final result of this work outputs the documents that are similar to each other, that are exactly similar to each other (copy documents) and the ones which are unique (outlier). The applications of document clustering include mining useful data in large datasets, web page clustering, search engines, anti-plagiarism checkers etc.

Downloads

Download data is not yet available.

Article Details

Section
Articles

References

Jiahui Liu, Peter Dolan ,“Personalized news recommendation based on click behaviorâ€, 15th international conference on Intelligent user interfaces, ACM 2010, Pages 31-4, 10.1145/1719970.1719976 [2] Noam Slonim , Naftali Tishby, “The power of Word Clusters for Text Classificationâ€, 23rd European Colloquium on Information Retrieval Research, 2001, [3] Michael Steinbach, George Karypis, Vipin Kumar, “A comparison of Document Clustering Techniquesâ€, KDD Workshop on Text Mining, 2000. [4] Christopher D.Manning, Prabhakar Raghavan, and Hinrich Schutze, “An Introduction to Information Retrieval†Cambridge University, England. [5] Yieng Chen and Bing Quin, “A Comparison between the Algorithms: SOM and K-Meansâ€, May 2010. [6] Kristof Csorba ; Istvan Vajk , “Term Clustering and Confidence Measurement in Document Clusteringâ€, IEEE February 2007, DOI: 10.1109/ICCCYB.2006.305694 [7] Haojun Sun ; Zhihui Liu ; Lingjun Kong, “A Document Clustering Method Based on Hierarchical Algorithm with Model Clusteringâ€, IEEE April 2008, DOI: 10.1109/WAINA.2008.45 [8] X. Cui ; T.E. Potok ; P. Palathingal, “Document clustering using particle swarm optimization†IEEE August 2005, DOI: 10.1109/SIS.2005.1501621