A NEW STAD MODEL TO PREDICT THE DIABETES MELLITUS

: Diabetes-mellitus refers to the metabolic disorder that happens due to less insulin secretion action. It is characterized by hyperglycemia. The persistant hyperglycemia of diabetes leads to damage, malfunction and failure of different organs such as kidneys, eyes, nerves, blood vessels and heart. Detection and diagnosis of diabetes at an early stage is the need of the day. Diabetes disease diagnosis and interpretation of the diabetes data is an important classification problem. A variety of data mining techniques are used to discover new patterns of disease and promote the early detection and diagnosis of complex diseases such as diabetes Rule extraction is on among them. The rules are extracted from the dataset. The extracted rules may not only be highly accurate, but also simple and easy to understand. Therefore in this study, The rule extraction algorithm Enhanced STAD model is proposed to achieve highly accurate, concise, and interpretable classification rules for the pima Indian diabetes(PID) dataset, which comprises 768 samples with two classes(diabetes or non-diabetes) and eight attributes. The advanced decision tree algorithm is generated and used for classification. STAD model achieved substantially better accuracy and provided a considerably fewer average number of rules and antecedents. These results suggest that proposed algorithm, is more suitable for medical decision making including the diagnosis of all type of diabetes mellitus.


I. INTRODUCTION
Diabetes is often called a modern society disease. The lack of regular exercise and rising obesity rates are some of the main contributing factors for diabetes. It is a very serious disease that if not treated properly and on time, can lead to very serious complications, including death [1]. Detection and diagnosis of diabetes at an early stage is the need of the day. Diabetes disease diagnosis and interpretation of the diabetes data is an important classification problem [2]. Data classification problem is studied by statisticians and machine learning researchers. Data classification is widely used in variety of Engineering and scientific disciplines such as biology, psychology, medicines, marketing, computer vision, and artificial intelligence [3]. The goal of the data classification is to classify objects into a number of categories or classes. For a given dataset, the task of classification is to assign a class to the data object. In 2011 there were 347 million diabetics worldwide and by 2030 this number is expected to increase to 552 million. About 4.6 million deaths were caused by diabetes in 2011 and by 2030; it is projected to be the seventh leading cause of death [4]. According to the centers for disease control and prevention, an estimated 29.1million people or 9.3% of the US population, have diabetes [5], 8.1 million of whom remain undiagnosed. In 2010, diabetes was listed as the underlying cause of death on 90,000 death certificates and a cause of death another 3,44,525, making it the fourth leading cause of death in India [6].The peak age of onset of type 2 diabetes mellitus which was previously known as non-insulin dependent diabetes mellitus or adult-onset diabetes is typically later than that of type 1 diabetes and accounts for about 80-90% of all diagnosed adult cases of diabetes [7]. Type 2 diabetes mellitus usually starts with insulin resistance, a disorder in which cells primarily within the muscles, liver and fat tissue do not utilize insulin lose the ability to produce properly. The beta cells in the pancreas begin to gradually lose the ability to produce sufficient quantities of insulin as the need for the hormone increases [8]. In contrast to individuals some primarily have insulin resistance and only a minor defect insulin secretion and only slight insulin resistance. An increasing amount of data is being collected in medical databases and historical data on complex disease such as patient's blood glucose levels is becoming more widely available therefore traditional methods of manual analysis have become inadequate [9]. As a result a variety of data mining are being applied in order to discover new patterns of disease and promote the early detection and diagnosis of complex diseases such as diabetes [10]. In this study, Stipulation Technique with Advanced Decision tree (STAD) is applied for rule extraction. It is tested with Pima Indian Diabetes dataset (PID) [11]. The environmental attributes such as hereditary, life style are also considered in this study. It was observed that the proposed STAD model gave better results with respect to accuracy.

II. LITERATURE REVIEW
The Pima Indians Dataset [PID] has the highest reported incidence of diabetes in the world. Smith used the same dataset to test a model for prediction the onset of diabetes mellitus. This study is modeled to find the relationship between the onset of diabetes mellitus and previous risk factors for diabetes among Pima Indian data set [12]. In 2012 shanker [13] evaluated the effectiveness of artificial NN classifiers in predicting the onset of non-insulin dependent diabetes mellitus among the pima Indian female population [14]. According to knowler et al., the pima Indians have the highest reported incidence of diabetes in the world. Smith et al. [15] used the same dataset to test a model for predicting the onset diabetes mellitus. A study on semi-supervised fuzzy classification was conducted by lekkas and mikhailov [16] for the diagnosis of two medical problems. In their system, two domains contain records of actual patients with a known diagnosis were used.
They proposed the use of a new evolutionary approach to derive compact fuzzy classification systems directly from the data without any prior knowledge or assumptions regarding the distribution of the data. [17] The fuzzy membership functions are assigned to fuzzy variables. Rules and membership functions are then automatically created and optimized in an evolutionary process. A recent rule extraction algorithm that works in discrete and continuous data set by Rabybak et.al was proposed. The algorithm applies genetic programming to generate a syntactic tree representing a set of rules that mimics the functioning of the tree. The objective of the Re-Rx algorithm is to achieve highly accurate concise and interpretable classification rules for the PID dataset. The most important aim of Re-Rx is to improve the conciseness and interpretability of extracted rules for physicians, because the competition for achieving only better classification accuracy for the PID dataset. The existing Re-Rx algorithm is used to extract a set of concise and interpretable diagnostic rules for the PID. The number of rules extracted by Re-Rx is more compared to the proposed model.

III. THE PROPOSED STIPULATED TECHNIQUE WITH ADVANCED DECIDION TREE (STAD) MODEL
The STAD model extracts the If-then rules directly from the training Data using the advanced Decision tree. The rules are learned from decision tree, where each rule for a given class will ideally cover many of the class's tuples. Rules are learned one at time. Each time a rules is learned, the tuples covered by the rule are removed and the process repeats on the remaining tuples. Since the basic decision tree learns the rules one at a time, the rules learned are at high accuracy.
The rules need not necessary be of high coverage. The process continues until the terminating condition is met. For example when there are no more training tuples or the quality of rule returned is a user specified threshold. The learn one rule procedure finds the best rule for the current class given the current set of training tuples.
Typically rules are grown in a general to specific manner. This technique append by adding the attribute test as a logical consent to the existing condition of the rule antecedent. Consider the training set as Pima Indian Diabetes data., Attributes regarding each applicant include their BMI, OGTT, and DBP data set. The classifying attribute is BMI level, which indicates whether a diabetes or Non diabetes. To start with, the rule antecedent is empty gradually the other attributes are incorporated. For example, in this study the BMI, OGTT and DBP are considered as attribute to detect the diabetes.

A. Stipulation Technique with Advanced Decision tree (STAD) Algorithm
Step1: Train and prune an NN using the dataset S and all of its D and C attributes Step2: Let D' and C' be the sets of discrete a continuous attributes, respectively, still present in the network and let S' be the set of data samples correctly classified by the pruned network. Step3: Generate decision tree by using both discrete and continuous C' attributes .

B. Confusion Matrix
It is predicted that the person have diabetes is the predicted class will give the answer as "yes". It is predicted that the person have no diabetes is the predicted class will give the answer as "No". The classifier made a total of 768 predictions(e.g. patients were being tested for the presence of that disease). The classifier predicted "yes" 532 times, and "no" 236 times. In reality, 105 patients in the sample have the disease, and 60 patients do not have diabetes.

D. Histogram representation of STAD model compared with regular covering technique
In this representation, STAD model is compared with regular covering technique. In multi-objective optimization and economics, pare to optimality is always an important issue. In the case of medical rule extraction there is a tradeoff between high diagnostic accuracy and the interpretability of extracted rules. Physician may want to obtain extracted diagnostic rules with reduced accuracy and more interpretability. Needless to say, if the optimal solution can be found then the best extracted rules can be obtained. Ideally to extend the optimal solution to obtain a wider viable region that provides improvements in both diagnostic accuracy and interpretability, the rule extraction technique is used to find compromise between both requirements by building a simple rule set that mimics how the wellperforming complex model makes decisions. The comparative analysis of Regular covering technique and STAD model for PID dataset is shown in Figure 4.

V. CONCLUSION
The STAD model is more accurate, concise and interpretable and therefore more suitable for medical decision making. Actually high accuracy, conciseness and interpretability are achieved simultaneously by the proposed STAD model. The use of STAD model is expected to be particularly useful in patients with diabetes mellitus whose fracture risk is relatively high. Needless to say the diagnosis of diabetes mellitus remains a complex problem; therefore STAD model should be tested on more recent and complete diabetes datasets in future studies in order to ensure that the most highly accurate rules can be extracted for diagnosis.