IDENTIFYING PATIENTS AT RISK OF BREAST CANCER THROUGH DECISION TREES

: In this paper, we explore how the C4.5 algorithm can be applied to breast cancer datasets in order to extract and formulate rules for identifying risk factors. For this study, we have used the Wisconsin dataset containing 9 attributes related to various cell features and anomalies. We have then applied the C4.5 algorithm to that dataset to create a decision tree. From the inferred tree, the rules for identifying the patients at risk have been derived. With a training-set size of 200 patient records, our system was found to have an accuracy of 96.7%.


I. INTRODUCTION
Breast cancer is a type of cancer that develops from breast tissue and is often associated by a lump in the breast, change in breast shape, development of red and patchy skin, or fluid emanating from the nipple. The causes for breast cancer have not been fully understood till date. There are some genetic factors, and some environmental factors associated with its development. Breast cancer is preliminarily detected by a mammogram exam and confirmed by a biopsy. When a lesion is detected, typically a breast FNA (Fine Needle Aspiration) is performed. It is a simple procedure similar to drawing blood using needles. It is used to remove some fluid or cells from a breast lesion or cyst in order to determine the nature of the lesion. The extracted sample is smeared on a glass slide and sent to a pathological laboratory to be examined under a microscope. During examination of the tissue samples, 9 characteristics are usually considered [1]. Each characteristic is assigned a number in a scale from 1 to 10 by the pathologist; where the larger the number, the greater is the likelihood of malignancy. No single measurement however can be used to determine whether a given sample is benign or malignant. A decision tree is a decision support tool that describes conditions and possible outcomes in the form of a tree-like graph. Each non-terminal node in the tree represents a test or decision on the considered data item. Choice of a certain branch depends upon the outcome of the test. To classify a particular data item, we start at the root node and follow the assertions down until we reach a terminal node (or leaf). A decision is made when a terminal node is approached. Decision trees can also be interpreted as a special form of a rule set, characterized by their hierarchical organization of rules.
There are many popular algorithms that classify a given dataset and construct a decision tree in the process that encodes, in the form of rules, how the classification takes place. ID3 (Iterative Dichotomizer 3) [2] is one such popular algorithm developed by Ross Quinlan. It is typically used in machine learning and natural language processing applications. Quinlan subsequently improved this algorithm to create the C4.5 algorithm [3], which is one of the most widely used decision-tree algorithms.
The rest of this paper is organized as follows: in Section 2, we discuss some related work that has been done in this field for predicting breast cancer; in Section 3, we present the data used for this study, as well as the methods we have followed; in Section 4, we present our findings and discuss them; and we finally conclude in Section 5.

II. RELATED WORK
A lot of work has been done in the field of classification till date. Abbass [4] has used artificial neural networks for cancer diagnosis. Ratanamahatana and Dimitrios [5] have used decision trees for feature selection and have used the Wisconsin dataset. Mangasarian and Wolberg [6] [7], and Bennett and Mangasarian [8] have used linear programming for cancer diagnosis using the same dataset. Bennett et al. [9] have developed an ensemble method of classification for assembling labelled and unlabelled data. They have also used the breast cancer dataset for testing their methods. Grąbczewski and Włodzisław [10] have used decision tree forests for classification of breast cancer data.

A. Data
For  Table 1. Some values are missing from the dataset, and hence preprocessing was required before we could feed the data to the decision-tree algorithm.

B. Data Preprocessing
Since the dataset contains missing values, we have included a preprocessing phase which replaces the missing values by the median of the various values of the corresponding attribute. The median is a holistic measure that is equal to the middle value in a list of values arranged in either ascending or descending order. If the list is of even length, the median is the arithmetic mean of the two middle values.

C. Methods
We have used the C4.5 algorithm for classifying our dataset. The splitting test of a node in the C4.5 is defined to be the gain ratio. Here, the classification uses entropy and information gain for tree splitting. It is suitable for handling both categorical as well as continuous data. A threshold value is fixed such that all the values above the threshold are not taken into consideration. The initial step is to calculate information gain for each attribute. The attribute with the maximum gain will be preferred as the root node for the decision tree.
A sample S is partitioned as follows: 1. When all records in S belong to the same class, it is assigned to be a leaf of the tree. 2. When S contains no records, it is assigned to be a leaf of the tree. 3. When S contains records belonging to more than one class, S must be partitioned or refined into subsamples. A node for S is assigned to the tree, and children nodes are created under it which will hold the subsamples. There are many ways for testing which attribute should be chosen for partitioning the sample, but the most common test is the test of entropy. The entropy of a sample S is given by: Where, k is the number of classes; in our case k=2, |S| represents the number of records in sample S, freq(C i ,S) represents the number of records in S belonging to class C i . After S has been partitioned based on the n possible outcomes (values) for each attribute X, we compute the following: The attribute X having the highest Gain value is selected as the partitioning attribute. The process is repeated for every subsample associated with each node, till every sub-sample contains records of the same class.

IV. RESULTS
We have implemented the algorithm in Java 7 and have tested it using a sample size of 200. We have computed the accuracy as shown in Eqn. 4, and was found to be 96.71%. In Figure 1, we present the decision tree obtained using the C4.5 algorithm on the sample.

V. CONCLUSION
In this study we explored how successfully decision trees can be used to diagnose breast cancer from breast FNAC results. We showed that the C4.5 algorithm, when used with cancer datasets like the Wisconsin dataset can produce extremely high accuracy rates. Breast cancer takes thousands of lives with an estimated 533,600 deaths occurring in 2015. Using decision tree based classification systems would result in better diagnosis at an early stage for patients who are potentially at risk of breast cancer; and which in turn would help save thousands of lives each year.