A Review on K-Mode Clustering Algorithm

: The main purpose of the process of data mining is to extract useful information from a huge amount of dataset. As one of the most important tasks in data mining, clustering is the process of grouping object attributes and features such that the data objects in one group are more similar than data objects in another group. It is a form of unsupervised learning that means how data should be grouped the data objects (similar types) together will be not known in advance. The algorithms used for clustering are k-means algorithm, k-medoid algorithm, k-nearest neighbour algorithm, k-mode algorithm etc. The K-Mode Algorithm is an eminent algorithm which is an extension of the K-Means Algorithm for clustering data set with categorical attributes and is famous for its simplicity and speed. The ‘Simple Matching Dissimilarity’ measure is used instead of Euclidean distance and the ‘Mode’ of clusters is used instead of ‘Means’. In this paper, review on the K-Mode Algorithm is done.


INTRODUCTION
Data mining may be defined as the task of process the data from different dimensions and in turn summarized it into the useful information. This process consisted of extraction, transformation, and loading of transaction data onto the data warehouse system, save and process the data in a multidimensional database system, give data access to business analysts and information technology professionals, check the data by application software, present the data in a useful format, such as a graph or table [2]. The datasets to be mined contain millions of objects described by tens, hundreds or even thousands of various types of attributes or variables. The accessed data can be stored in one or more operational databases, a data warehouse or a flat file. Major components of data mining technology have been under development such as statistics, artificial intelligence and machine learning in research areas [1]. The data mining operations and algorithms are required to deal with different types of attributes. In this sophisticated data analysis tools are used along with visualization techniques to segment the data. After this it probability of future events are evaluated [2]. It involves the anomaly detection, association rule learning, classification, regression, summarization and clustering. In data mining the data is mined using two learning approaches i.e. supervised learning or unsupervised learning [5]. A. Supervised learning: In this learning, data includes together the input and the desired result. It is the fast and a perfect learning method. The accurate results are known and are given in inputs to the model during learning procedure. Neural network, Multilayer perception, Decision tree is supervised models. B. Unsupervised learning: The desired result is not provided to the unsupervised model during learning procedure. This method can be used to cluster the input data in classes on the basis of their statistical properties only. These models are for various types of clustering, kmeans, distances and normalization, self-organizing maps [3].

CLUSTERING USING K-MODE ALGORITHM
Clustering is one of the fundamental tools available, for understanding the nature of the dataset. It is the unsupervised learning that used to place data elements into related groups without advance knowledge of the group definitions [4]. It has alienated the large dataset into groups or clusters according to similarity of properties. From a practical perspective, it plays an outstanding role in data mining applications such as information retrieval and text mining, spatial database applications, Web analysis, marketing, medical diagnostics and many others [1]. Clustering algorithms have five categories like hierarchical based algorithms, partition-based algorithms, density-based algorithms and grid based algorithms. This method maps all the objects in a cluster into a number of square cells, known as grids. It has a fast processing time that depends on the size of the grid instead of the data. STING, CLIQUE etc.

Model Based Clustering
In this method, each of the clusters is best fitted to the given model. It may locate clusters by constructing a density function that reflects the space distribution of the data points.
Partitioning Based Clustering is one popular approach of clustering, which transfer objects by moving them from one cluster to another cluster starting from a certain point. The amount of clusters for this technique should be predefined. The algorithms used in this approach are K-Means Algorithm, K-Medoid Algorithm, K-Nearest Neighbour Algorithm etc [4]. K-Means Algorithm is a partitioning based algorithm for clustering that creates clusters of the same type of data according to their closeness to each other based on the Euclidean distance [5]. It intends to partition the objects into a number of clusters in which each object belongs to that cluster with the nearest mean. This method produces exactly the different number of clusters of greater separation distance which is not known as a priori and must be computed from the data [6]. K-Mode Algorithm is an extension of K-Means Algorithm and is the partitioning based clustering algorithm. It uses simple matching dissimilarity function instead of using Euclidean distance. Modes are used to represent centroids and a frequency based method is used to find the centroids in each iteration of the algorithm [7].

Algorithm for K-Mode Clustering:
The steps for k-mode algorithm are as follow: INPUT: Number of desired clusters K, Data objects D= {d1, d2…dn} OUTPUT: A set of K clusters 1. Generate K clusters arbitrarily by selecting the data objects and choose K initial cluster centre, one for every of the cluster.
2. Assign data object to the cluster whose cluster centre is near toward it according to Equation (1) and (2).
(2) 1, otherwise 3. Update the K cluster base on allocation of data objects.
Calculate K latest modes of every one clusters. 4. Repeat step 2 to 3 awaiting no data object has changed cluster relationship otherwise some additional predefined criterion is fulfil.
K-Mode, an eminent algorithm, works well for categorical datasets whereas K-Means Algorithm does not work well for Categorical datasets. It is famous for simplicity, speed and is linearly scalable with respect to the dataset.

SURVEY ON THE VARIANTS OF K-MODE
The Survey on the variants of k-mode algorithm is divided into three sections. A. First section discusses the existing ways to select initial centroids to improve the accuracy of the clusters in K-Modes algorithm. B. Second section discusses the algorithm to find an appropriate dissimilarity measure for the dataset containing both numerical and categorical data. C. Third section discusses the way to remove the dependency on specifying the number of clusters.

A. Selection of Initial Centroids in K-Mode Algorithm
In this section, many possibilities are provided to improve the accuracy of the clusters by improving the selection of initial centroids of the cluster in the K-Mode Algorithm and discussed in the Table 2. Table 2. Selection of initial centroids in k-mode algorithm

S. No. Algorithm Name Description Limitation
This algorithm uses mode instead of calculating means, and a frequency based method is used to update modes in the clustering to deal with categorical attributes.
Initial Modes are chosen randomly.
2 K-Prototype Algorithm (1998) [9] This algorithm integrates the dissimilarity measure in the K-Means and K-Mode algorithms for clustering objects having mixed numeric and categorical.
The relative frequencies of attribute values are not taken into account in the cluster centroids. He introduced an initialization method based on Bradley's iterative initial-point refinement algorithm to the K-Modes clustering.
Many parameters have to be asserted in advance and it takes more time to compute. 4 COOLCAT Algorithm (2002) [11] This algorithm is able to deal with clustering of data streams and is based on the notion of entropy.
It depends on inputting the parameter m that represents the size of the smallest cluster. 5 Distance based K-mode (2009) [12] This algorithm proposed initialization method for categorical data and the distance between objects was calculated based on the frequency of attribute values.
The subsample is selected randomly and the single clustering result cannot be guaranteed. 6 Cluster Center Initialization based Kmode (2013) [13] Some objects whose features are very similar to each other are introduced to this algorithm and have same cluster membership irrespective of the choice of initial cluster centres.
The accuracy of the clusters produced is not better than other algorithms.

7
Entropy based K-Mode Algorithm (2015) [14] This algorithm improves the cluster accuracy with the analyses of its time complexity while retaining the scalability of the K-Mode Algorithm.
This algorithm can be improved by some other optimization algorithm while retaining its scalability.

B. Dissimilarity Metric based K-Mode Algorithm
In this section, the amount of work carried out in developing a dissimilarity measure to deal with the datasets containing categorical data and mixed data. Some of the work is discussed in the Table 3.
The dissimilarity is measured in terms of distance function in order to provide the goodness of the cluster. An appropriate metric is used in order to achieve the best clustering because it directly influences the shape of clusters.

C. K-Mode Algorithm independent of input parameter
In this section there will be discussion of two algorithms that deals with the limitation of inputting the value of k to improve the accuracy of the clusters in the k-mode algorithm in which k is the number of clusters formed. These algorithms are discussed with its advantage as well as limitations in the Table 4. This algorithm proposed a dissimilarity measure based on the similarity between a data object and cluster mode.
It carried forward the same weakness as in K-Modes of choosing the initial modes randomly. 2 K-Mode based upon distance metrics (2007) [16] This algorithm proposed a dissimilarity measure based on the distance between two attribute values of the same attribute.
It is not suitable for noisy and high dimensional datasets. 3 K-Mode based on cost function (2007) [17] The proposed cost function added weight for numeric attributes computed from the dataset and all numeric attributes were normalized and discretised to do the calculations.
This algorithm can be improved further by improving the discretising methods for numeric valued attributes. 4 Dissimilarity based k-mode (2007) [18] This algorithm proposed a new dissimilarity measure in which the modes of clusters were updated in each iteration.
It takes more time to compute.

DVD based K-Mode (2009) [19]
The information about distribution of data correlated to each categorical value was used to define the dissimilarity measure.
It takes more computation time and the memory. 6 DILCA Algorithm (2011) [20] The distance between two values of a categorical attribute was determined by the way in which the value of the other attributes was distributed in the dataset.
The performance depends on some input parameter of this algorithm.

DISC Algorithm (2011) [21]
This algorithm suggested this measure didn't require any domain knowledge to understand the dataset.
It requires feedback from a classifier for more accurate results. 8 Biological and Genetic taxonomy information based kmode (2012) [22] This algorithm suggested a new dissimilarity measure based on the idea of biological and genetic taxonomy and rough membership function.
It takes more computation time than that of the K-Modes with Huang's measure. This algorithm used a regularization parameter to control the number of clusters in the clustering process.
It takes more memory.
2 K-Mode based upon unified similarity metrics (2013) [24] This algorithm penalized competitive learning algorithm and these algorithm required some initial value of k which should be greater than the original value of k.
This algorithm can be further improved by better optimization algorithm in terms of accuracy.
K-mode based upon cluster centre algorithm proposed by San et al. in which a suitable value of regularization parameter was chosen to find the most stable clustering results in 2004. K-Mode based upon unified similarity metrics algorithm proposed by Cheung et al. in which the resulting clusters are more accurate than the original K-Mode Algorithm. Both the algorithm provides the better result than the original k-mode algorithm and provides accurate number of clusters. In this section, previous work done by the researcher in the k-mode clustering is reviewed. Clustering categorical data is an important research topic in data mining. There is the list of the different optimized k-mode algorithm in order to obtain the accurate result in all the above three tables.

COMPARATIVE ANALYSIS
The comparison of the various k-mode algorithms is discussed in the Table 4 based on different output parameters. The comparison of the different k-mode algorithms such as k-prototype, modified k-mode, COOLCAT, DVD based k-mode, DILCA, DISC based kmode algorithm etc. is discussed. The output parameters that are used in Table 5 are discussed as follows: • True Positive Rate (TP): A true positive test result is one that detects the condition when the condition is present. •

Execution Time:
The execution time is defined as the time spent by the system executing the task. Better than k-mode and k-prototype Better than kmode and kprototype. -- The comparison of the improvements done in the k-mode algorithm is shown in the Table 5. This table describes the comparison based on the output parameters such as accuracy, precision, recall and execution time performed on different datasets by researchers.

CONCLUSION
The determination of grouping in a set of unlabelled information on the basis of its features is the main objective of clustering. This review work discussed most of the kmode clustering technique with different approaches. From the discussion, it may be analyzed that there is not any absolute best criterion which can be independent of the final aim of the clustering. This paper presents the analysis of kmode clustering with their limitations which helps the researcher to select the one according to their need. Some limitations of existing algorithm will be eliminated in the future. This technique will be useful in extraction of useful information using cluster from large data set.