GENERATION OF A HYBRID CLUSTERING ALGORITHM FOR BIG DATA

In this paper, a Hybrid Algorithm for clustering big data is proposed which is based on Rank Similarity. Rank Similarity is calculated by taking the sum of both Cosine and Gaussian Similarity. Proposed Technique is compared with the existing technique which is based on Cosine Similarity only. Comparison is done by taking parameters precision, recall, F-Measure, and accuracy. Results are evaluated on Java Netbeans 8.2.


Big Data
As stated by IBM, with pervasive handheld devices, communication of machine-to-machine, online/mobile social networks, 2.5 quintillion bytes of data is created every day from the last two years. It became tough for the users to store, capture, manage, analyse, share, and visualize with related data and processing tools. Because of this, big data concept has been proposed. The capability for data generation has never been enormous and powerful since the development of the IT (Information Technology) in the late 19 century. As another example, dated on October 4, 2012, first presidential debate between President Obama and Prime Minister Mitt Romney has debated all these tweets and triggered more than 10 million tweets in two hours and generates the discussion at the specific moment, in fact, reveals the public interest with the discussion on Medicare and vouchers. However, the term 'big data' is still vague. As shown in Wikipedia, Big Data is a data set that contains all the terms of any, large and complex data, difficult to use traditional data processing applications for processing. Widely accepted definition belongs to IDC: 'big data technology describes a new generation of technology and architecture, that aims to achieve high-speed capture, discovery and / or economic analysis to extract value from large amounts of data' to explore the use of large and exceptional value data that must increase the risk of security privacy. For example, 'Amazon' monitor user's shopping preferences. Facebook also seems to attract all the information, as well as our social relationships. Mobile operators not only know to whom the person is talking but the availability of someone to the user. The promising values are in sighted to the one that analyses and the signs depict the further surge in another's storage, re-usage and gathering of the personal data. If the age of the Internet threat to security and privacy, then the era of big data will endanger them. Before moving ahead for what big data is, a moment is required to look at the below diagram by Hewlett-Packard:

Clustering
Grouping of data in different sets or classes or in clusters is known as the Clustering. The data which is placed in one cluster is similar to other data in that cluster; also this data is dissimilar to data present in other clusters. Dissimilarities can be calculated according to various attributes.There are various distance measures which describe the dissimilarity in the various data objects. These dissimilarity attributes are then used to construct a Dissimilarity matrix. Clustering of data is useful in various fields like, data mining, statistics, biology, and machine learning.In literature, numerous clustering algorithms are discussed. Every algorithm has its own pros and cons; also they find there use differently in different situations [1]: Typically clustering algorithms are categorized in the following categories: 1. Partitioning Methods. 2. Density-Based Methods. 3. Hierarchical Methods. 4. Grid-Based Methods.

Supervised and Unsupervised Learning Based
Methods. In this paper, basically two important (Partitioning and Density-Based) Methods are exploited to do the clustering. Partitioning Methods: Suppose there is a database containing n objects or data tuples and the task is to divide these data objects into different clusters, say, K clusters, where k ≤ n. Then, the partitioning method is used to do there clustering according to the dissimilarity between various data objects. The objects which are similar are in same group and which are dissimilar are placed in different groups. There are some essential requirements which should be met by the clustering algorithm, these are: (1) the cluster must not be empty, i.e., every cluster should contain at least one data object, and (2) no data object is shared among clusters.k-Means and k-Medoids, are two well-known partitioning clustering methods [2]. Density-Based Methods: Partitioning methods discussed above are used to divide the data into clusters that are mainly of spherical shape. But some time there is a need to cluster the objects in arbitrarily shapes. In these situations, the notion of density is used to create the clusters (that may not be of spherical shapes). The idea behind the densitybasedmethods is that to add the data objects in a given cluster until its density exceeds some threshold. The concept of neighbourhood similarity is also taken into account. The clusters resulting from density-based methods may be of arbitrarily shapes [3,4].

GENERATION OF CLUSTERS USING COSINE SIMILARITY
2.1 Term Frequency.The TF is a text statistical-based technique which has been widely used in many search engines and information retrieval systems. Assume that there is a collection of 500 documents and the task is to compute the similarity between two given documents (or a document and a query). The following describes the steps of acquiring the similarity value [5, 6]: 1. Document pre-processing steps • Tokenization: A document is treated as a string (or bag of words), and then partitioned into a list of tokens.

•
Removing stop words/ Stemming word: Stop words are frequently occurring, insignificant words. This step eliminates the stop words.

Document representation
• Generate the Index terms and then represent them as N-dimensional vector in term space.

Computing Term weights
• Compute the Term Frequency.
• Do Term Frequency weighting. After the 3 steps stated above, measure the similarity between two documents: The cosine similarity can be calculated by measuring the cosine of the angle between two document vectors Cosine Similarity, s(x, y) = . || |||| || (1) where, x t is a transposition of vector x, ||x|| is the Euclidean norm of vector x, ||y|| is the Euclidean norm of vector y, and s is the cosine of the angle between vectors x and y [7].

Cluster Formation
There are numerous clustering algorithms occur in exploration but centroid selection based clustering k-mean algorithm is general because of its simplicity for execution and competence to harvest good results. It is a dividing based methodology which divides dataset into pre-defined k partition known as clusters which have minimum intra cluster distance. K-mean algorithm is based on partition and it will work according no of clusters k given at the time of input. Algorithm arranges all the given objects into k partitions and each partition is known as separate cluster. It is simple and straight forward in nature. In the proposed approach, the clusters are created using centroid selection. Initially evaluate the centroid for the different clusters, and then add further documents in the clusters by choosing the cluster whose centroid is nearest to the document.In every step documents are placed in the different clusters and the formation of clusters is done. For each step, there is also a need to calculate the error function. If new centroids provide lower error function value then the new centroid will be kept and movement will be continued in same direction otherwise if value of error function is higher than previous then the movement direction will change. Continuing the process till end of all the documents will result in the formation of Clusters. In this paper, algorithm for Cluster formation using only cosine similarity as measure is called First Technique and Cluster formation using Hybrid Clustering Algorithm is called Second Technique.

Results Evaluated in First Technique
Atotal of 500 text files have been uploaded till now for processing

GENERATION OF HYBRID ALGORITHM
Hybrid Algorithm makes use of Rank Similarity, i.e., sum of Cosine similarity and Gaussian similarity [8]. Concept of neighbour reachable from the main document is exploited.

Fig.7. shows that the Precision is reduced in Hybrid
Algorithm but the other parameters viz., Recall, F Measure are improved and ultimately results in the more accurate algorithm. The accuracy is increased from 4.84 % to 6.07 %.