IMPROVING THE PERFORMANCE OF A CLASSIFICATION BASED OUTLIER DETECTION SYSTEM USING DIMENSIONALITY REDUCTION TECHNIQUES

: The basic concept of the classification based outlier is to train a model which separate outliers from normal data. The medical cancer dataset is used for the application of classification based anomaly detection. With the comparison of C4.5 and Decision Tree classification algorithms, it is clear that K-Neighborhood algorithm is more suitable for the identification of outliers in terms of f-score, error rate and accuracy. Also the time taken for identification of outlier using KNN is less than that of C4.5 and Decision Tree. In this work, the classification performance for the identification of outlier is measured using dimensionality reduction algorithms like PCA, KPCA and LPP, and the result reveals that the influence of dimensionality reduction on the cancer dataset is very much enhanced the classification performance to a significant level.


INTRODUCTION
Outliers may be significant items, which represent the general characteristic of the object. This work aims to study the performance of classification algorithms for outlier detection using dimensionality reduction. Before the elimination of items, one should study the relevance of item in the dataset In high dimensional data set, some attributes may be irrelevant. But by using feature selection approaches such as filler and wrapper, we have to find out the subset of the original attributes.

Problem Specification
The identification of outlier can be viewed as classification problem which can lead to the discovery of unexpected knowledge in the medical field. The general idea is to train a classification model that can distinguish normal data from outliers [1].
In medical cancer dataset, the available number of malignant/outlier samples are less than that of the normal/benign and it causes an inaccurate classifier model. Many solutions like factor analysis and principle component methods were suggested to improve the efficiency of the algorithm with the elimination of variables .This method proposes to use dimensionality reduction and feature selection algorithms to overcome the training performance and testing accuracy issues in the classification based outlier detection approaches.

MODELING CLASSIFICATION BASED OUTLIER DETECTION SYSTEM
The popular methods of outlier detection are supervised, semi supervised, unsupervised proximity-based methods . The Grubb's test identifies one outlier at a time in a univariate data. The Rosener's test is a sequential procedure for detecting maximum of ten outliers. So there is a need of more sophisticated and speedy method known as classification based outlier detection, which heavily depends on the quality and availability of training data set.

A. Algorithm for dimensionality reduction
The number of variables used to describe an object is known as the dimensionality of that object. The dimensionality reduction is the search for a subset of features to describe the original dimension.

(a). Principal Component Analysis
Principal Component Analysis is used to leaving out the data which is of the least important to the information stored in the data set. It compresses an N-dimensional vector to Mdimensional vector, where M<N.

(b).Kernel PCA.
Kernel PCA is a technique for extracting non-linear mappings that maximize the variance in the data.

(c).LPP (Locality Preserving Projection)
LPP is a classical linear technique which projects the data along the directions of maximal variance by calculating the optimal linear approximations to the eigen functions of the Laplace beltrani operator on the main fold.

B. The Model of Dimensionality Reduction
The framework of a classification based outlier detection system that are going to develop and check in this work is shown as in figure 1.

b) DT Classifier
It is a predicative modeling tool that identifies the most important attributes by hierarchical breakdown of the data.

c) K-NN Classifier
K-Nearest Neighbors is a method to assign the input instance to the class with the majority of K-Nearest Neighbors by considering the Euclidean distances between two instances.

III. THE ASSESSMENT
The efficiency of the classification algorithms under evaluation were tested with "Wisconsin Breast Cancer Database"
The WBCD dataset is summarized in Table 1

Metrics Used For Assessment
Random index and Run time are two events for assessing the algorithm under consideration. The total run time is the total time taken for training and testing, but this model focus on the time taken for training which is more than the time taken for testing.

Assessment of Performance a) Confusion Matrix
A Confusion matrix reveals the type of classification error a classifier produced. The advantage of using this matrix is that it not only tells us how many got misclassified but also what misclassifications occurred.  Error Rate = (T -C) /T, The test data has total of T objects and C of the T objects are correctly classified.

b) Validation Methods
The validation method used in this work is K-fold cross validation. The data set is partitioned into K-disjoint subsets of almost equal size. One of the subsections is treated as the test set and the classifier is built with residue. The accuracy is estimated with the test set. The procedure is done recurrently k times so that each subsection is treated as a test subset only once. One of the K-subset is used as a test set and remain K-1 subsets are put together to form a training set. Thus each data point gets a chance to be in a test set only once.
In the first iteration, the subgroup c 2 ,…,c k , jointly served as a training set while c 1 is treated as test set for the first model. The second iteration is trained with subsets c 1 , c 3 ,…,c k and tested on c 2: and so on [20].

About the Implementation
The proposed outlier detection software is established with Matlab version 7.4.0 (R2007a) and uses some of the features of Weka with Matlab interface code. The Mex and Java interface of Matlab is used to implement this outlier detection software. The standard weka implementation of the classification algorithms are used here and the default parameters are passed while invoking the classifier algorithms [20].

IV. THE RESULTS AND DISCUSSION
In the second plot clearly shows that the benign records are grouped together and form a distinct cluster. The red points that are deviating from the black cluster are the outliers which signify the malignant nature of that case [16][17] [20].
Each table cell value is the average of 100 separate runs with different training and testing data sets because each one is an average of 10 trials and in each trail had a10-fold validation.

The Effect of Dimensionality Reduction Algorithms
This experiment reveals the outlier detection performance with different number of dimensionality reduction algorithms as well as different feature sets. But it is clear that if the number of dimension as 5, then it will be sufficient to represent the whole data and hence produced good results. So, there is a significant improvement in performance.
The sensitivity or recall measures the proportion of actual malignant records that are correctly identified as outliers. As shown in the graph, with respect to sensitivity or recall, the proposed PCA+C4.5 and proposed PCA+ Decision Table  classifiers performed well. The accuracy measures the capability of the algorithms to correctly identify the normal as well as outliers in the data. As shown in the graph, with respect to accuracy, the proposed PCA +C4.5 and proposed PCA+ Decision Table classifiers performed better than others.   The specificity measures the proportion of normal records that are correctly identified and the graph is shown below.  .

Performance in Terms of Specificity
The following bar chart shows the performance of the algorithm in terms of precision.

V. CONCLUSION
The performance of outlier detection using dimensionality reduction algorithm is executed with Mathlab and outlier detection software. The result illustrates that the influence of the algorithm on the cancer dataset is considerably high and increases the whole classification performance.
The excellent outlier detection performance of the proposed PCA+C4.5 classifier and proposed PCA + Decision Table classifier algorithm reveal that a classification algorithm will be capable of perfectly identifying the multidimensional outlier data in its subspace.
The upcoming work may address the chance of improving the performance of classification algorithm using a good distance metric.