EMPIRICAL EVALUATION OF MACHINE LEARNING ALGORITHMS FOR AUTOMATIC DOCUMENT CLASSIFICATION

: Automatic document classification process is the important area of research in the field of Text Mining(TM). Text mining is the process of discovering the interesting pattern or knowledge from huge amount of data. The document classification process used in many domains. Here, to take the classification process is apply SMS spam classification. The benchmarked dataset is used and the same data set is processed in various ML algorithms of Naïve Bayes, Support Vector Machine, Decision Tree and Logistic Regression model. In this paper evaluates the results of various machine learning algorithms for automatic document classification in SMS spam classification.


I. INTRODUCTION
Automatic text document classification is the one among a prime functionality in the field of Text Mining area due to the exponential growth of unstructured data in the current digital era. The primary objective of classification functionality is to assign each document a predefined label automatically based on its contents. It is widely used in knowledge extraction and knowledge representation in text data sets. The well known applications which employs document classification functionalities are email categorization, spam filtering, directory maintenance, mail routing, news monitoring and narrow casting, etc. In general, the text document classification process includes the two major phases namely, document representation and classification process. The document representation process is divided into two steps. They are feature extraction and feature selection. The feature extraction involves various preprocessing activities to reduce the document complexity and make the classification process in easier manner. Usually, the preprocessing process incorporates the stop word removal, stemming of words, punctuation removal and finally tokenization process. The feature extraction process includes the calculation of Term Frequency (TF) and Inverse Document Frequency (IDF) from the tokenized documents. Finally, all the documents are normalized to unit length. The second phase of document classification is the application of machine learning algorithms. Many machine learning algorithms are available like supervised, semi supervised and unsupervised learning algorithms. This paper focuses on supervised machine learning algorithms like Naive Bayes (NB), Support Vector Machine (SVM), K-Nearest Neighbor (K-NN), Decision Tree (DT) and Logistic Regression (LR) in automating the text document classification. The rest of the paper is organized as follows. Section 2 discuses about various machine learning algorithms used for classification process. This is followed in Section 3 by some experiments on SMS spam classification task. Finally, Section 4 concludes the paper.

II. DIFFERENT TYPES OF APPROACHES
Ethem Alpaydin defines Machine Learning (ML) is a paradigm which "optimize a performance criterion using example data or past experience" [7]. Machine learning is the intersection of computer science, engineering, and statistics and often appears in other disciplines. Machine learning uses statistics to solve many classification and clustering problems. The ML algorithms are classified in three categories. They are supervised, unsupervised and semi supervised. Now we discuss about few machine learning algorithms, like, Naïve Bayes (NB), Support Vector Machine (SVM), K-Nearest Neighbor (K-NN), Decision Tree (DT) and Logistic Regression (LR).

A. Naïve Bayes Classification:
The Naive Bayes (NB) classifier is classical and probabilistic classifier. It is a supervised learning technique of ML. It support only on numeric and textual data [2], [6], [13], [17], [20]. NB focuses on text document classification process and many application areas like detection of spam email, sorting personal email, classification of documents, language recognition and recognition of sentiment analysis. The merits of Naive Bayes are simple, fast and very effective; to eliminate the noisy and missing data values. Easy to capture the probability estimation of a prediction. Some demerits are fall on an often supposition of equally important and independent features; No ideal datasets with many features of numeric and less reliable than the predicted classes of estimated probabilities.

B. Support Vector Machines:
The SVM which works on the basis of statistical based method and also a supervised learning technique of ML. It is mainly used to solve the problems of regression and categorization [3], [18], [19], [23]. It's using a sigmoid kernel function is alike two-layer perceptron. A given class members of n-dimensional vectors and it is used to discriminate positive and negative, the training set supports both positive and negative. Computational learning theory that performs the structural risk minimization. SVM advantages are it can be used for classification or prediction of numeric problems. Not overly influenced by noisy data and not very prone to over fitting. It is easy to use than artificial neural networks, specifically due to the existence of many well-supported SVM models. Some disadvantages of SVM are the training is very slow, in case of input dataset has a huge feature. The result represented in a complex black box model. Find the best model, used to various combinations of kernels.

C. K-Nearest Neighbour:
The k-NN is supervised learning algorithm and also a nonparametric regression algorithm for text categorization [1], [4], [5], [8], [15], [21]. It is a first typical approach, classifies new cases based on a similarity measure, i.e. by using distance functions. By using some similarity measure such as Euclidean distance measure, etc., the distance is calculated by the Euclidean formula, as in Eq. (1) Dist(x,y) = The merits of KNN are Simple and effective, makes no assumptions about the underlying data distribution and performs well in the training phase. Its demerits is in classification phase it works very slow and requires some special cases while handling some missing data in the training phase.

D. Decision Trees:
A decision tree model to support the decisions and their possible outcome, including fortuity results, resource costs, and usefulness. It's like a tree structure and hierarchical structure with the acyclic directed graphs as shown in figure  2; The starting node is always a root node and the root node connects directly to the next level nodes. Final nodes (leafs) represents the categories of document, the tree leaf nodes hold examine the categorized documents should and travel all the nodes in order [10], [12], [14], [16]. Branches link nodes of adjacent levels, and then the testing process is executes on the selected document attributes. The test results are connected to branches proceeds to specific nodes of the bottom level. It can be focus the connections in specific nodes, reflect as an influence diagrams. The strength of decision tress is to receives all-intent classifier that does well on many problems, the automatic process skill is high, it accepts the numeric and ostensible features, to avoids the missing data trivial features. It supports on both small and huge datasets. Weaknesses of decision tree are models are splits on features in huge number of levels often biased. In large tree is easy to over fit or under fit the model. It can be critical to interpret and decisions they make may seem counterintuitive.

E. Logistic Regression:
Logistic regression is a powerful statistical model. In this model produces a binomial result of one or more descriptive variables. It calculates the relationship between the classification dependent variable and self-determining variables. Logistic function is used to estimating probabilities, which is the consolidate logistic distribution. [9], [11], [22].

III. EXPERIMENTAL SETUP:
A.

B. Experimental Results:
The outcomes of the experiments are visualized in the form of confusion matrix which shows the relationship between the positive and negative predictions of the class labels according to the given experimental design setup with one of the following grouping.  Figure 3 depicts the above said properties for the SMS spam classification task. Specifically, a confusion matrix is also called error matrix, is distinct table layout that allows visualization of the performance of the applied machine learning algorithms with benchmark dataset, confusion matrix. Various important statistical measures like accuracy, error, precision agreement, precision error, kappa statistics, sensitivity, specificity, precision, recall and F-measure are calculated from the resultant confusion matrix. According to our experimental procedure, the positive class is spam, which is our point of interest of the prediction.

IV. CONCLUSION
This paper investigated the state-of-the-art machine learning algorithms in text document classification. A comparison between them was also conducted in correspondence to the benchmark SMS spam dataset to find the SMS messages are either ham or spam using the well-established statistical measures like accuracy, kappa statistics, sensitivity, specificity, precision, recall and F-measure. At the nutshell the statistical classifier Naïve Bayes algorithms shows better performance in all categories.