A PROPOSAL FOR PREDICTING MISSING VALUES IN A DATASET USING SUPERVISED LEARNING

Missing values occur frequently in various field experiments and trials of data. These missing values in a dataset pose challenges for the data miners and analysts working on that dataset. Hence knowing how to predict those missing values is important. The process of replacing missing value with the predicted value is called Imputation. In this paper we propose an Imputation method to predict the missing values based on supervised learning classification scheme. The proposed method first maps the missing value problem into a classification problem by discretization of the known available values. Further we make use of C 4.5 decision tree algorithm for prediction of the discrete nominal values corresponding to the missing values. Finally we predict the numeric values for the missing places using Local Closet Fit algorithm where the term local is defined by the discretization of the known values of the attribute with missing values. The performance of the proposed method is compared with the existing schemes for data imputation where the results show that the proposed method has higher prediction accuracy.


I. INTRODUCTION
Missing data in a data set refers to an instance wherein no data value is stored for the variable in the observation of interest. Various problems are faced with missing data while mining the dataset [1]. Firstly, the absence of data reduces the probability that the test will reject the null hypothesis when it is false since it does not have the complete data. Secondly the lost data causes bias in the estimation of parameters. Also missing values reduces the significance of the samples obtained. Hence, missing values makes the analysis of dataset highly complicated and leads to invalid conclusions several times. Due to frequent occurrence of missing values in the training observation, prediction of the missing data has always remained at the center of attention of knowledge discovery in databases and data mining search community [2]. One could also think of discarding the instances with missing value but this would lead to loss of important information and inaccurate inference from about the data [3]. Hence prediction of missing data is a better choice than eliminating the instance as whole. A number of approaches for prediction of missing values have been devised over time. Some of these methods include concept mean method, kmeans clustering [5], unsupervised learning [4], event covering, LEM2 [6] etc.
The objective of this paper is to predict the missing value of an attribute using a supervised classification scheme. Classification is a data mining function that assigns items in a collection to target class [7]. It not only studies the sample data but also predicts the future behavior of that sample data. The classification process includes two phases represented as follows: The first phase is the learning phase in which training data is analyzed and based on that analysis a classifier model is built as shown in figure 1. In the second phase the test set is evaluated on the developed classifier to predict the class values.
The rest of the paper is organised as follows: Section 2 includes the background study where we discuss various discretization approaches and classification techniques. In section 3, the proposed method for prediction of missing values for an attribute in a dataset is explained. Section 4 shows the results and analysis of the proposed algorithm on weka tool. Section 5 finally concludes the paper along with the future scope of research in the present study.

II. BACKGROUND STUDY
Prediction of missing values of an attribute in a data set using classification involves mapping the missing value problem into classification problem. This mapping in turns requires discretization of the continuous normal attributes. Then the so generated discrete normal attribute is used as the target or the class attribute in the classification.

A. Normal Distribution
The normal distribution can be specified completely by two parameters, which are Mean (µ) and Standard Deviation (σ). If the mean and the standard deviation are known then one essentially knows as much as if one had the entire data set. A quick estimate of the spread of data that follows the normal distribution is known as empirical rule [8] provided the mean and the standard deviation are known. It says that 68% of data lies within the first standard deviation of the mean, 95% of the data lie in two standard deviations of the mean where as almost 97% of the data will fall in three standard deviations of the mean.

B. Discretization
Discretization is a process of converting or partitioning continuous attributes to discrete or nominal attributes. Thus it transforms quantitative data into qualitative data. The discretization process consists of two steps [9]. First, number of discrete intervals is chosen either by some heuristic technique or by running multiple times with different number of intervals and deciding the best choice by using some criterion. Secondly, the cut points must be determined, which is often done by the discretization algorithm itself. Some of the popular discretization techniques are as follows: i.
Equal Interval Binning This method of discretization divides the entire range into a predetermined number of equal intervals. Uneven distribution of data points is a drawback of this method as some intervals may contain much more data points than other. This can seriously impair the ability of the attribute for building good decision structures. ii.
Equal Frequency Binning This method of discretization tries to overcome the limitations of the above discussed equal width binning by dividing the domain into intervals with same number of data points. It works by obtaining the maximum and minimum values of the attribute and sorts all values (n) in increasing order. Further it divides the interval from min to max value into k intervals such that every interval contains the same number (n/k) of the sorted values. Entropy based Discretization Entropy based discretization hinges on two ideas. First, the data should be split into intervals that maximize the information, measured by entropy. Secondly, the partitioning should not be too fine grained to avoid refitting. Out of the all possible splitting values, it takes the one that generates the best gain and repeats in recursive fashion.

C. Classification
Classification is a data mining technique typically used to extract models describing important data classes. It helps in finding out in which group each data instance is related within a given dataset. This technique can also be used to predict categorical class labels for the test set provided the training set. Following are the existing prominent classification algorithms: i.
K-Nearest neighbour Algorithm: K nearest neighbours [10] is a simple algorithm that stores all the available cases and classifies new classes based on similarity measure like distance function. An object is classified by the majority vote of its neighbours, with the object being assigned to the class most common amongst its K nearest neighbours where K is a small positive integer. ii.
ID3 Algorithm: ID3 is an algorithm proposed by Ross Quinlan that generates decision trees which can be further used for classification problems. The algorithm starts with original set as the root hub [10]. It then chooses the attribute with the lowest entropy to split the set and produce subset of information. The algorithm then recurs on each and every item in the subset and considering only the items that were never selected before.
iii. C 4.5 Algorithm: C 4.5 algorithm is an extension to ID3 decision tree algorithm [3]. It is a supervised learning algorithm that uses training samples (pairs of input object and output class value) to build a classifier that correctly classifies the test set (input objects without class values). The classifier used by the C 4.5 is a decision tree which is built from root to leaves using the training data as in ID3 algorithm. C 4.5 is based on information gain ratio referred to as feature selection measure that is evaluated by entropy [10].

III. PROPOSED METHODOLOGY
Predicting missing values is generally considered to be a part of the data cleansing phase done before data mining or any further analysis. Our proposed method for prediction of missing values is restricted to a single attribute with numerical values. This method first maps the missing value problem into a classification problem using a proposed discretization algorithm based on normal distribution. Then the values are predicted using classification algorithm. The proposed method consists of three modules namely Discretization, Classification and Prediction.

A. Discretization:
The proposed Normal Distribution based discretization method consists of following steps: Step 1: Take all the available instance of the attribute with the missing value.
Step 2: Find the maximum and minimum values for it.
Step 4: Partition into k number of classes based on (µ) and (σ).

B. Classification:
The classification in our proposed method is done using the C 4.5 classification method that generates classifiers expressed as decision tree [11]. It is one of the best decision tree algorithms that can be easily interpreted and can deal with noise.

C. Prediction:
We use Local Closest Fit (LCF) approach [12] for performing the prediction of numerical value from the interval predicted by the trained classifier as output. The LCF algorithm works as follows: Suppose a dataset D old having missing value in attribute a i is separated in two datasets F and M where in A is the class label then the pseudo code for LCF algorithm is given as: Compute the distance (X, Y) between X and every instance X such that Y.class = A MinInstance ← the instance with the minimum value of distance(X, Y) X.a i ← MinInstance.a i End For Where the value of distance(X, Y) is calculated as Xi-Yi divided by the difference of maximum and minimum values in that class (r). Figure 2 represents the proposed prediction algorithm in terms of a flowchart as follows: We can briefly say that the proposed algorithm consists of following steps: Step 1: Take the supplied data set D old with missing values in the attribute a i .
Step 2: Split D old into two datasets F (containing all filled instances) and M (containing all instances with missing attribute values).
Step 3: Discretize the attribute a i in F using normal distribution based discretization.
Step 4: Build a C 4.5 classifier by training the dataset F with nominal values of a i as the target class.
Step 5: Test the dataset M on the above classifier to predict the nominal values corresponding to the missing values.
Step 6 : Use the Local Closet Fit algorithm to predict the numeric value corresponding to the nominal value of a i .

IV. RESULTS AND ANALYSIS
The proposed approach for predicting missing values has been tested with two different datasets which are Iris dataset (5 attributes and 150 instances), Shuffle dataset (9 attributes and 14500 instances). Before performing the analysis we first manually replace some of the values of a single attribute in these datasets with "?". Then the instances having the "?" are separated from those without it resulting in the formation of a training data set F (with all filled instances) and a test data set M (containing all missing value instances).

A. Tool used: Weka
Weka stands for Waikato Environment for Knowledge Analysis which is a collection of many state of the art machine learning algorithms and data pre-processing tools [13,14]. It was developed at the University of Waikato in New Zealand. It provides extensive support for the whole process of experimental data mining, evaluating learning schemes statistically and visualizing results of learning algorithms.

B. Results
First we apply discretization to the entire range for a given attribute in K (where K=5, 7and 9) number of intervals (also called binning) as follows:    Table 2 shows the comparison of actual value and the predicted value of an attribute using the existing concept mean (CM) method, most common value (MCV) method and the proposed prediction method with 5, 7 and 9 interval binning for Iris dataset. Table 3 shows the comparison of average error in the prediction of missing values for the concept mean (CM) method, most common value (MCV) method and the proposed prediction method with 5, 7 and 9 interval binning for Iris dataset.  Table 4 shows the comparison of the prediction accuracy percentage for the concept mean (CM) method, most common value (MCV) method and the proposed prediction method with 5, 7 and 9 interval binning for Iris dataset. From Table3, Table 4, it is clear that the proposed method provides better result in terms of average error of prediction and percentage prediction accuracy over the existing state of art methods for the Iris dataset. Also we observe that increasing number of intervals of discretization do not have any advantage to the prediction result at the cost of increase in computation time. Furthermore comparing the accuracy of prediction of each attribute of the proposed method with different number of intervals of discretization, it is found that the proposed method with 5 Bins gives the better results for the prediction. Following figure 3 represents the obtained results for the Iris dataset in graphical form. Fig. 2 Comparison of Prediction Accuracy for Iris Dataset Following Table 5 shows the comparison of actual value and the predicted value of an attribute using the existing concept mean (CM) method, most common value (MCV) method and the proposed prediction method with 5, 7 and 9 interval binning for Shuttle dataset.  Table 6 shows the comparison of average error in the prediction of missing values for the concept mean (CM) method, most common value (MCV) method and the proposed prediction method with 5, 7 and 9 interval binning for Shuttle dataset.  Table 7 shows the comparison of the prediction accuracy percentage for the concept mean (CM) method, most common value (MCV) method and the proposed prediction method with 5, 7 and 9 interval binning for Shuttle dataset. From Table 6, Table 7, it is clear that the proposed method provides better result in terms of average error of prediction and percentage prediction accuracy over the existing state of art methods for the Shuttle dataset. Also we observe that the proposed method with 5 Bins gives the better results for the prediction than with 7 or 9 Bins. Following figure 4 represents the comparison of prediction accuracy for the concept mean (CM) method, most common value (MCV) method and the proposed prediction method with 5, 7 and 9 interval binning for the Shuttle dataset in graphical form.

V. CONCLUSION AND FUTURE WORK
In this work, we proposed a method for prediction of missing values in a dataset based on classification scheme. The proposed method first maps the missing value problem into a classification problem by performing normal distribution based discretization of the known values of the missing attribute. Then it performs the prediction of the nominal value corresponding to the missing values using classification. Finally known prediction approaches are employed on the new data set to predict the values.
The analysis in the light of the shuttle and the iris data set show that the proposed method with local closet fit approach provides the best results both in terms of average error prediction and average accuracy. Since the proposed method works well when attributes of the dataset follows normal distribution, hence there is a scope of adopting another suitable discretization approach in case the attributes are not normally distributed. Also we dealt with only numeric attributes; hence there is a scope of handling categorical attributes in future.