ENHANCING FILTER BASED ALGORITHMS FOR SELECTING OPTIMAL FEATURES FROM THYROID DISEASE DATASET

: In medical science, automatic disease diagnosis is an invaluable tool because of restricted observation of the specialist and uncertainties in medical knowledge. Advances in medical information technology have enabled healthcare industries to automatically collect huge amount of data through clinical laboratory examinations. To explore these data, the past few years have envisaged the use of Computer Aided Diagnosis (CAD) systems in many screening sites and hospitals. While using CAD, thyroid function diagnosis is considered as a classification problem, which can automatically identify the type of thyroid (hyper, hypo or normal). Machine learning techniques are increasingly introduced to construct the CAD systems owing to its strong capability of extracting complex relationships in the biomedical data


INTRODUCTION
Data mining plays a vital role in medical field for disease diagnosis. It offers lot of classification techniques to predict the disease accuracy [1]. The computer based analysis system indicates the mechanized medical diagnosis system. This mechanized diagnosis system support the medical practitioner to make good decision in treatment and disease [2].Classification maps data into predefined groups or classes. It is frequently referred to as supervised learning because the classes are determined before examining the data [3].Filter based method selects the feature without depending upon the type of classifier used. The advantage of this method is that, it is simple and independent of the type of classifier used so feature selection need to be done only once [4].

A. THYROID DISEASE
Thyroid disease (TD) is a study of Endocrinology and is considered as one of the most common diseases that is frequently misunderstood and misdiagnosed. Thyroid disease is a medical condition that affects the function of the thyroid gland. In general, disorders of the thyroid gland fall into the two categories. They are • Hyperthyroidism -Condition when the thyroid produces too much hormone, which makes the body use energy faster than it should. • Hypothyroidism. -Condition when the thyroid doesn't produce enough hormones, which makes the body use energy slower than it should. Patients with this disease have a complex relationship with metabolism and body weight and unless treated properly, can lead to serious faults like Decreased taste, Decreased smelling ability, Memory loss and Depression. Thus, early and correct diagnosis of this disease is an important task of medical diagnosis.

B. THYROID DISEASE DIAGNOSIS
Proper interpretation of the thyroid data besides clinical examination and complementary investigation is an important issue in the diagnosis of thyroid disease. Doctors can incorporate numerous factors, including clinical evaluation, blood tests, imaging tests, biopsies, and other tests to diagnose thyroid disease. A common used method is a test, called the thyroid-stimulating hormone (TSH) test, which can identify thyroid disorders even before the onset of symptoms. Usage of CAD systems for diagnosis provides multiple advantages • Can minimize the operator-dependent nature inherent in medical imaging systems and can make the diagnostic process reproducible. • Help to improve the accuracy of diagnosis • Can work with features (like computational features and statistical features) that cannot be obtained through visual analysis or through intuitive examinations.

LITERATURE REVIEW:
Zhenning Wu et. al., [5] have proposed a PIMclustering-based FSVM algorithm for classification problems with outliers or noises. The experiments have been conducted on five benchmark datasets to test the generalization performance of the PIM-FSVM algorithm. Their results have shown that the PIM-FSVM algorithm presents more reasonable memberships and is more robust than other methods used in their paper for classification problems with outliers or noises. Second, the computational complexity of the PIM-FSVM algorithm is presented, which is not more complex or even less complex than other methods. Zhiquan Qi et. al., [6] have proposed a new Structural Twin Support Vector Machine (called S-TWSVM), which is sensitive to the structure of the data distribution. They firstly pointed out the shortcomings of the existing algorithms based on structural information and designed a new S-TWSVM algorithm and analysis with its advantages and relationships with other algorithms. Theoretical analysis and all experimental results shown that, the S-TWSVM can more fully exploit this prior structural information to improve the classification accuracy. Himanshu Rai et. al., [7] have introduced a novel and efficient approach for iris feature extraction and recognition. They compared the recognition accuracy with the previous reported approaches for finding better recognition rate than using SVM or Hamming distance alone. They claim for the increase of efficiency, when they used separate feature extraction techniques for SVM and Hamming distance based classifier and proven that the accuracy of the proposed method is excellent for the CASIA as well as for the Chek image database in term of FAR and FRR.
Zuriani Mustaffa et. al., [8], have reported empirical results that examine the feasibility of eABC-LSSVM in predicting prices of the time series of interest. The performance of their proposed prediction model was evaluated using four statistical metric, namely MAPE, PA, SMAPE and RMSPE and experimented using three different set of data arrangement, in order to choose the best data arrangement for generalization purposes. In addition, the proposed technique also has proven its capability in avoiding premature convergence that finally leads to a good generalization performance.
Khyati K. Gandhi, Prof. Nilesh B. Prajapati in 2014 performed [9] feature selection techniques on diabetes data set (Pima Indian diabetic database) from UCI repository. Fscore, ReliefF and Genetic Algorithm are used for feature selection from the diabetes dataset and then the classification is performed by using Support Vector Machine classifier. It has been analyzed that the performance of SVM is better enhanced by using F-score technique on diabetes dataset. The accuracy achieved by Fscore is more than the other methods. The accuracy of Genetic Algorithm is analyzed by using Support Vector Machine as well as by Artificial Neural Networks. The result shown that the accuracy achieved is more in case of SVM.
Xiaobo Li et al in 2011 presented [10] a comparison of seven different feature selection techniques on multiclass cancer dataset. The seven feature selection methods are Correlation based, Chi-Squared, Gain Ratio, Information Gain, ReliefF, SVM-RFE and Symmetrical Uncertainty. The experimental results show that the feature selection by using SVM-RFE gives better performance than other six methods. The feature selection on multiclass cancer is critical, but it is possible to achieve better accuracy on the dataset by using proper feature selection and classification methods. show that the K-means and SVM hybrid model reduces the time required for prediction with higher rate of accuracy.
Esin Dogantekin [12] [13] have proposed two hybrid method for thyroid disease diagnosis. One method is based on principal component analysis and least square support vector machine and has produced 97.67% accuracy. The other method is based on Generalized Discriminate Analysis and wavelet support vector machine and this method has achieved 91.86% of accuracy. For both these studies thyroid dataset has been downloaded from UCI machine learning repository S. Yasodha et al. [14] have proposed CACC-SVM techniques which is hybridization of class-Attribute Contingency Coefficient (CACC) and support vector machine(SVM) for classification of thyroid data. The proposed model achieved better accuracy compared to other traditional models.
Nikita Sigh and Alka Jindal [15] have concluded that SVM is better classifier as compared to KNN and Bayesian. Accuracy of SVM is about 84.62%. KNN found the nearest neighborhood automatically. It represented by the graph each vertices having object. Bayesian based on the probability classification which gives the sample data belongs to a class.

METHODOLOGY:
Filter-based algorithms rely on general characteristics of the data to evaluate and select feature subsets without involving any mining algorithm.  • Subset feature quality is improved by using two filter criteria, instead of using one as in conventional methods • Minimizes discrepancies and thus increases the performance during thyroid disease classification

RESULTS AND DISCUSSION
In the experiments discussed, the analysis of the feature selection algorithm performance was done using three frequently used classifiers, namely, BPNN (Back Propagation Neural Network), KNN (K Nearest Neighbor) and SVM (Support Vector Machine) classifier. From the results, it is clear that the SVM produces high accuracy and hence the next research work is planned to improve the working of SVM.

CONCLUSION
The datasets are taken from UCI Thyroid dataset with the number of Instances 7200 and 21 Attributes. Performance metrics taken are Accuracy, Sensitivity, Speed and Specificity. This experiment reveals that the efficiency of the proposed IMFFS algorithm is better in terms of all the selected performance metrics, when compared to the conventional algorithm. This indicates that the algorithm is able to remove maximum redundant algorithm while preserving the relevant (or important) data.