PREDICTION AND FEATURE REDUCTION USING NON PARAMETRIC DATA MINING TECHNIQUES

: Dimensionality Reduction is a technique that endeavors to convert the data from high dimensional space to a less dimensional space while holding measurements among them and further promotes the accuracy. Data mining has great potential in healthcare field. In this paper different data mining classification techniques like k-Nearest Neighbor, Support Vector Machine, Random Forest, and Principal Component Analysis have been implemented. This paper deals with Attribute selection for Dimensionality reduction in Machine learning. The experimental results are tabulated and graphs indicate the performance of each of the technique used. The Support Vector Machine provides better results with highest accuracy and least error rate, when compared with other classifiers.


INTRODUCTION
Dimensionality reduction is the investigation technique for lessening the quantity of measurements portraying the question. Data Mining has attracted great attention from various fields due to wide and large data present in these fields. The information and knowledge gained by data mining and their applications can be used in various areas including market analysis, Business and E-Commerce fraud detection, customer retention, production control, Scientific, Engineering, and HealthCare etc. Various data mining techniques can be applied in various fields [1]. This paper discusses different data mining classification techniques like k-Nearest Neighbour, Support Vector Machine, Random Forests, and Principal Component Analysis (PCA). The knearest neighbor algorithm is a non parametric technique, used for classifying based on closest training examples in the feature space. Support Vector Machine is to find the optimal separating hyperplane, because it correctly classifies the training data. Random Forest is an ensemble classifier that consists of many decision trees and outputs the class that is the mode of the class's output by individual trees. PCA seeks to reduce the dimension of the data by finding a few orthogonal linear combinations (the principal components PCs) of the original variables with the largest variance.

LITERATURE REVIEW
S.Neelamegam, Dr.E.Ramaraj, et.al., [4], "Classification algorithm in Data mining: An Overview" of different data mining classification techniques including decision tree, K-Nearest Neighbor, Support vector machine. Gopala Krishna Murthy Nookala, et. al., [5] examined the performance analysis and evaluation of different data mining algorithms used for cancer classification. From the acquired results, it's shown that the performance of classifier depends on the data set. Ashfaq Ahmed K, et.al., [6] presented the "prediction performance with support vector machine and random forest", Different training models are created using different kernel functions like Linear, Polynomial, Radial functions Venkatadri M, Lokanatha C. Reddy, [7] Presented the "comparative performance with Decision Tree techniques", from the study's observed that there is a varying accuracy of classification.
Moloud Abdar,et.al., [8] presented, "comparison of data mining algorithms in prediction of healthcare". There are five algorithms including decision tree, neural network, support vector machine and k-nearest neighbour, logistic regression are used for classification and comparison. Jolliffe I, et.a1., [9] Principal component analysis (PCA) is a mathematical algorithm that reduces the dimensionality of the data while retaining most of the variation in the dataset . It accomplishes reduction by identifying directions, called principal components, along which the variation in the data is maximal.

PROPOSED METHODOLOGY
The proposed method classifies the level of locations based on tobacco use risk factor and compares the performance of K-Nearest Neighbour, Random Forest, Support vector Machine and Principal Component Analysis over this data. The objective of this paper includes: • Choose the dataset to work with.
• Preparing the data.
• Apply the data on classification algorithms.
• Compare the performance of the algorithms.

Dataset Description
The Dataset used for this study is taken from the web and it is a behavior risk factor data of

Feature Selection or Attribute Selection
Attribute subset selection, reduces the dataset size by removing irrelevant or redundant attributes. In machine learning Feature selection also known as attribute selection or variable selection Reason for Feature selection techniques: • simplification of models to make them easier to interpret by researchers/users, • shorter training times, • to avoid the curse of dimensionality, • Enhanced generalization by reducing over fitting.

Cross Validation
Cross validation is a technique used to assess and evaluate the performance of machine learning algorithms. This technique is applied on new datasets that is not yet trained. In every round of cross validation we randomly partition the given original data set into a training set that is used for training a machine learning algorithm and testing set for evaluating its performance.

Pre-processing the Data
The original dataset retrieved from the web has noisy and missing values. This may affect the quality of results, in order to improve the quality of data and mining results the raw tobacco data is pre-processed so as to improve the efficiency of the mining process. The proposed method uses pre-processing methods such as data reduction, and replacing missing values.

CLASSIFICATION TECHNIQUE
Data mining have different types of classification techniques as follows

K-Nearest Neighbor
K-Nearest Neighbor is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions)[4] [5]. The simple version of the KNN classifier algorithms is to predict the target label by finding the Nearest Neighbor class. The closest class will be identified using the distance measures like Euclidean Distance.

Support Vector Machine
A Support Vector Machine (SVM) is a supervised classifier formally defined by a separating hyper plane. The Goal of a SVM is to find the best hyper plane which maximizes the margin of a training data. Support Vector Machine is a Supervised Learning algorithm (SLA). SVM is a classification algorithm, which means we will use it to predict if individual belongs to a particular class. For instance, we can have that training data below [6] [8]. We have plotted the size and weight of several people, and there is also a way to distinguish between male and female. Just by looking at the plot, we could trace a line and then all the data points representing male will be above the line, and all the data points representing Female will be below the line, as shown in figure 3

Prinicipal Component Analysis
Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components (or Sometimes, principal modes of variation. [9] To perform dimensionality reduction while preserving as much of the randomness in the high-dimensional space is possible. Principal Component Analysis is realized on Covariance Matrix or on the Correlation Matrix these matrix can be calculated from the Data matrix as shown in figure 6.

PERFORMANCE MEASURES
After the classification process, the performance of the used algorithms are compared based on the performance measures such as correctly and incorrectly classified instances, kappa statistics, mean absolute error, root mean squared error, relative absolute error, root relative squared error, true positive rate and false positive rate. For the ease of comparison task the acquired results are interpreted as graph. Combination of precision and recall 2* Precision * Sensitivity / Precision +Sensitivity

RESULTS AND DISCUSSION
A result shows the accuracy and kappa of confusion matrix. The tested results are tabulated and graphs indicate the performance of each of the technique used.

Accuracy Measures
Accuracy is calculated using Confusion Matrix. Accuracy = (TP+TN) / (TP+FN+FP+TN). Accuracy is measures for each algorithm. It is useful tool for analyzing how well your classifier can recognize tuples of different classes. We can speak of the error rate or Misclassification rate of a classifier M, which is simply (1 -Accuracy).
Various factors affect the level of accuracy. Dimensions, quality of data, record size and many other things. 6.2 Accuracy Measures of classification Techniques with confusion Matrix 6.2.1 Results for k-Nearest Neighbor K-Nearest Neighbor obtained accuracy of 11.1% of correctly classified instances and 88.9% of incorrectly classified instances on "Tobacco use" data. The following table 3 shows the confusion matrix of k-Nearest Neighbor: 88.9%

Results for Support Vector Machine
Support Vector Machine obtained accuracy of 90% correctly classified instances and 10% of incorrectly classified instances on "Tobacco use" data. The following table 4 shows the confusion matrix of Support Vector Machine.  15.4%

Results for Principal Component Analysis
Principal Component Analysis obtained accuracy of 12.1% of correctly classified instances and 87.9% of incorrectly classified instances on "Tobacco use" data. The following Table 6 shows the confusion matrix of Principal Component Analysis.