A COMPARATIVE STUDY OF ASSOCIATION RULE MINING TECHNIQUES AND PREDICTIVE MINING APPROACHES FOR ASSOCIATION CLASSIFICATION

: Association Rule Mining ( ARM) and classification are integrated together to build competitive classifier models called Associative Classifiers and this approach is known as Association Classification (AC). AC leads to the formation of accurate classifier consisting of significant rules capable of predicting the class of the data. This paper presents the evolution of ARM to AC highlighting the development and improvements in ARM techniques followed by AC techniques. The goal of this paper is to survey and understand different ARM and AC techniques and comparing their performance. In the literature a variety of AC algorithms have be proposed such as CBA, CMAR, MCAR, CPAR etc each adopting some or the other approach for rule learning in the initial stages. This paper also presents the importance of the rule pruning methodology with the brief survey of different methods discussed in the literature. This paper also enlightens the learning approaches adopted by different AC techniques in different domains.


INTRODUCTION
ARM is an emerging research area in data mining that aims at extracting valuable information from huge volume of data and applying it in decision making. Today organizations are drowning in data but starving for knowledge. With the advent of internet, e-data is growing at extremely higher pace making it difficult for the competitors to position them in World Wide Trading [I]. Data mining technology helps in extracting patterns to understand the data, produce knowledge and use it for future predictions. Data mining can be classified into two categories namely descriptive mining and Prescriptive mining [1]. Descriptive mining refers to summarization and characterization of data in the repository. Prescriptive mining generate inferences on existing data to predict the future trends based on the past data.ARM is one of the descriptive mining technique used to generate associations among the items in the transactional or relational dataset. [2]. Use of ARMfor building classification models(classifier)result in unique approach called Association Classification (AC). AC was introduced in 1997 for developing relationships between attribute values [3]. Association Classification is the integration of Association rule mining and classification. Classification process maps a group of attributes to a class that can be used to assign the classes of new data objects based on their attribute values. The classifiers produced by AC techniques are considered to be more accurate than traditional classification approaches like decision trees. The AC approaches are found to be successful in real world application from different domains such as academics, medical diagnosis, web filtering etc. A number of AC techniques are used in literature such as ADT, CAEP, CMAR, MMAC, MCAR, L3, CBA etc each use different method of extracting and pruning rules [4]. Building a classifier based on AC involves extraction of classification association rules from training dataset followed by selecting a subset to build the classifier. Subset of rules is obtained by evaluating complete set of class association rules and only considering the rules that cover defined training data records. After building the classifier its prediction strength is tested on the test dataset for predicting the class labels. Thus, AC attempts to explore the relationships between attribute values to assign classes aiming to obtain essential knowledge not taken care of by traditional classification methods thereby improving the classification accuracy. While building classifiers, producing complete set of rules need CPU time and many dataset scans during the training phase. The problem with AC is the generation of huge number of rule making it difficult to understand and manage. Thus to improve CPU usage and minimize the dataset scans, there is need to build classifiers with minimum number of interesting rules. This can be achieved using rule pruning approach. Rule pruning approach when applied to AC produces high quality and scalable classifier making the classifier more manageable. The AC techniques use different pruning methods most commonly used are pessimistic error, database coverage, lazy pruning discussed in literature to minimize the size of resulting classifiers. Many more methods exist each having different characteristics and application depending on the application domain. Now a days due to growing text documents solving the problem of text categorization is becoming a necessity. This work will include the development of different pruning and prediction methods to be implemented in the association classification and then application of association classifiers for text categorization of both structured and structured data. This paper is organized in 5sections with section1 discussing the introduction, section2 describing the importance of study, section3 presents the methodology for conducting the Research study, section4 briefly discussing the literature review consisting of sub sections with subsection A describing the ARM techniques and their comparison, subsection B introducing Association classification techniques, subsection C discussing rule learning approaches, subsection D describing different rule pruning techniques used in the literature, subsection D briefly introducing Text Categorization using Association classification and lastly section4 concluding he paper with future work discussions.

Research Gap
• While building classifiers, producing complete set of rules need CPU time and many dataset scans during the training phase. • The problem with AC is the generation of huge number of rule making it difficult to understand and manage. • Thus to improve CPU usage and minimize the dataset scans, there is need to build classifiers with minimum number of interesting rules. • This can be achieved using rule pruning approach. Rule pruning approach when applied to AC produces high quality and scalable classifier making the classifier more manageable. • Whether applying high confidence rule for making prediction relatively enhance the classification accuracy.

Aim of the Study
The aim of this study is to achieve the various goals such as to produce an extensive literature review on common association rule mining approaches with specific elaboration on rule pruning threshold criteria and class assignment tasks. These two phases are discussed in detail because of their importance in solving the problems in AC approach generating large number of rules, CPU and memory usage and lastly the over fitting in classification [5] . The study also focuses on the impact of minimizing the number of rules on the effectiveness and efficiency of the classifier through rule pruning technique. This study analyzes the effect of employing high confidence rule strategy for assignment of class label on test dataset based on classification accuracy. The study will proceed by developing an AC model by employing rule pruning and class assignment methods. Then the model will be exploited on particular benchmark datasets and algorithm and will be compared with other classification models.

Research Objectives
1. To minimize number of association classification rules by implement predictive mining approaches (rule pruning).

To build Association based Classifier by
considering high confidence rules. 3. To apply the proposed Classifier on test data and examine the impact on the classification accuracy. 4. To perform a Comparative Analysis of the proposed classifier with the traditional classifiers.

METHODOLOGY
To achieve the research objectives the study will adopt the following methodology:

Figure1: Association Classification Model
This proposed model will be adapted to work on text based training dataset. Then required thresholds (minimum support and minimum confidence) will be defined. AC system begins with processing of training data by discovering frequent item sets followed by generation of Association Classification rules. The rules are then filtered to obtain significant rules to build the classifier .Finally the classifier will be applied to the testing dataset already in preprocessed format.

RELATED WORK
In spite of existence of many text classifiers in literature based on different classification approach, automated text categorization is a vital area of research the needs improvement in terms of accuracy. The implementation of ARM for building ACM is still very less in Text Categorization. This section will produce detailed review on classic ARM elaborating rule pruning based association classification.

A. Association Rule Mining (ARM)
Association Rule mining being an important research branch of data mining aims to find interesting and frequent patterns, discover correlations among set of data in data repositories .The concept of ARM was introduced by Agrawal [ 3 ].The purpose of ARM is to discover hidden relationships among different data item sets in the database. Assuming a given transaction database the ARM problem is to generate association rules considering two predefined thresholds.

Mathematical Model of ARM
Let I={I1,I2,I3,…….,In} be the set of n different item sets .Sis the set of all the transactions in the database, T is a transaction such that T is a subset of I ,where each transaction is a collection of item sets and has a unique identifier. The association rule is the instantiation of the form A→B, where A, B I are the collection of items called item sets and A B= , ie. A and B are disjoint .A is called antecedent and B is called consequent. The rule is terms as A implies B .if |a|=k, then a is called k order set and cab be expressed as A [1], A [2], A[k]. The association rule can be evaluated based on two parameters support(s) and confidence(c). The database being huge ,the user is concerned about frequently used item sets that can be generated using predefined thresholds of support and confidence called minimum support (minsup) and minimum confidence (minconf). Support can be measured as statistical significance of the association rule in the database that demonstrates the degree of representation of the rule. The greater the support, the more important is the rule [2].In other words it is the fraction of transactions that contain both A and B. S(A→B)= |TA| |TB| |D| Where , |TA| is number of transactions containing item sets A,|TB| is number of transactions containing item sets A and |D| is total number of transactions in Database D. Confidence can be measured as accuracy of association rule. It measures how often items in B appear in transactions that contain A. C(A→B)= |TA TB| |TA| Where, |TA TB| is number transactions containing both A and B,|TA| is number transactions containing item sets A. The confidence c of rule A→B is defined as c% transactions in the database D containing A also contains B and support s is s% transaction in D that contains AUB. The goal of ARM is to discover rules having support >= minsup threshold and confidence>=minconf threshold such association rules are more strong and effective [6].

Process of ARM
ARM is a two step process consisting of extraction of all frequent item sets followed by extraction of strong association rules from the obtained frequent item sets.

Figure2:Process of ARM
Association rules are interesting or frequent if their support and confidence are greater than minimum support and confidence thresholds defined [7].The item sets that are expected to be frequent are known as candidate item sets [1].

ARM Algorithm
Step1-Assume k=1, and generate frequent item sets of length 1.
Step-2 Repeat till new frequent item sets are found.
 Obtain (k+1) candidate item sets from k-length frequent item sets.  Prune infrequent candidate item sets containing subsets of length k  Calculate the support for each candidate item set by passing over the database.  Item sets not having minimum support are discarded and leaving only frequent item sets called k-item sets. It attempts to minimize the harmful impacts as well as maximize possible benefits in the mining process.

Genetic Algorithm
It involves mining of positive and negative association rules in database using genetic and fitness operators and functions without taking minsup and minconf into account.
It is proved to be efficient mining process not dependent upon support and confidence. This algorithm involves two stages : a) Rule Generation-calculates set of all positive and negative association rules followed by pruning of contradicting rules and selecting a subset of high quality. b) classification-extracts a subset of rules found in first stage and predict the class label of data object by analyzing the subset rules. It uses hybrid Approach to deal with large size dataset It provides more accurate and efficient classification detection of frequent item sets among large databases.

2
Weighted ARM Algorithm [2] The weighted ARM algorithm is similar to Apriori in framework but different in functionality. The weighted support of the algorithm may be greater than 1 which contradicts the actual support that should be less than 1. Also Apriori algorithm needs to scan the database frequently.
Involves less database scans repeatedly while producing frequent item sets, thereby improving the efficiency of data mining. 3 Improved version of Apriori Algorithm [9] This algorithm is based on four characteristics-1) Data preparation and chooses the desired data. 2) Produce item sets that decide the rule constraints for knowledge. 3) Mine k-frequent item sets using new database 4) Produce the association rule that set up the knowledge It provides superior results for using knowledge base. 4 Selective Association Rule Generation [10] This algorithm is based on defining a set of "interesting item sets" and the selectively generate rules of only these item sets.
When applied to a dataset the number of rules found was significantly reduced 5 Probabilistic Data Modeling for ARM. [11] This is a framework used to analyze interesting measure and helps develop new interesting measures based on statistical test and geared towards the specific properties of transaction data. This process improves the quality of mined rules from transactional database and enhances the reliability

SNo. Purpose
Description Results 1 Evaluate students' performance by selecting some attributes and generating rules using Apriori . [12] ANN is used to check the accuracy of results. Using MLPNN for selection of interesting features based on 10-fold cross validation. Rules are generated. The techniques are applied to dataset taken from college premises.
to get the good university performance students have to be good in their assignment, attendance and Unit Test.

2
Identification of failure pattern students using Apriori. [13] Offer helpful recommendations for academic planners using patterns analysis aiming to reduce failure rate leading to improved academic performance. Data is taken from student's result repository.
Hidden relationship are identified between the failed courses and causes of the failure to improve students 'performances. 3 Evaluates student's performance based on various attributes using Apriori. [14] Study was conducted on student's data pursuing Master of Computer Application (MCA) degree from Pune University. Important rules were generated to measure correlation among various attributes to improve the student's academic performance.
find various association rules between attributes like students graduation percentage, Attendance, Assignment work, Unit test Performance and how these attributes affect the student's university result 4 Analyzing Student's performance and extracting placement pattern using Apriori [15] Study was conducted on data obtained from computer science department of an engineering college in 2011-12 by predicting the student's performance using internal and external assessment marks and attendance. Generated association rules that categorized student in three categories-Good, Average, and Poor based on which the placement was provided.

5
Developing a smart academic advising system to guide a student for selecting courses using Apriori [16 ] The study was conducted on Student's data taken from Jordon University of Science and Technology by first preprocessing the data and then generating classification rules to classify the student registered for courses or not.
The target user can use the rules to get recommendations about the courses to register.

B. Associative Classification
Associative classification is originated from ARM that take into consideration only class attribute in the consequent part (right hand side) of the rule XY, the Y represents the class attribute. The associative classification problem consists of a training dataset T having m distinct attributes A1,A2,….,Am and a list of class attributes denoted by C. |T| denotes the numbers of rows in T [4]. The AC algorithm consists of three steps-firstly to discover rule items, then building classifier from the discovered rules and finally predicting the class labels and evaluate the classifier through the metrics such as classification accuracy, number of rules generated and error rate. A classifier corresponds to a mapping R: AC where A is the set of frequent items and C is the set of Class labels. In Association Classification the set of rules is constructed to predict the class of test data set with increased accuracy. In mathematical terms , the classifier aims to find r belonging to R that can maximize the probability of satisfying r(a)=c for each test data. A numerous association classification algorithms exist in literature using different rule discovery and rule pruning approaches. The most commonly used AC algorithms are [5]: 1. Classification based on Association rules (CBA)-It involves generation of rules using Apriori followed by building a classifier.

Classification based on Multiple Association Rules (CMAR)-It performs classification based on weighted
Chi-Square analysis applying multiple strong association rules.

Classification based on Predictive Association Rules
(CMAR)-It implements greedy approach for generating rules from training dataset.

Multiclass Classification based on Association Rules
(MCAR)-It involves two step process. Firstly it generates frequent rules item also called candidate rule items that involve more attributes based on minconf and minsup, followed by application of the rules on the training dataset for building a classifier. Other AC algorithms include ADT,CAEP, NMAC,L 3 that exists in literature. [17] introduced CCSA(Cascading of clustering based on Schwarz Criteria and association)algorithm to perform clustering followed by classification based on association using Apriori association for generation of classification rules. The algorithm was analyzed on online datasets with Weka resulting in improved classification accuracy with reduced number of rules. [18] attempts to analyze the performance of CPAR(Classification based on Predictive Association Rules) , PRM(Predictive Rule Mining), FOIL(First order Inductive Learner) methods on Tuberculosis dataset where CPAR and PRM performed well in terms of accuracy, no of rules and time consumption as compared to FOIL. [19] tried to use multiple relational Bayesian Classification with MCAR algorithm depending on Genetic Algorithm(GA) employed for optimization of classification rate using association rules. The results implies that MCAR with GA showed higher classification accuracy than traditional MCAR.

Rule Discovery approaches in Association Classification
The first step in AC to discover item rules in two sub steps -initially to discover frequent items and then generating rules to form a set of Classification Association Rules (CARs)for building the classifier . Following approaches are discovered for rule generation:

Apriori Search
In Apriori algorithm the discovery of frequent itemsets can be achieved in few iterations involving database scan in each iteration. The decision of frequent item is based on the support of candidate rule item . Apriori algorithm is used by CBA algorithm to generate frequent ruleitems. This approach requires repetitive database seems consuming more computation time at each level. Many techniques are discussed in the literature that extends the Apriori algorithm to generate classification rules such as Apriori TiD.

Vertical Mining Approach
This concept uses simple interaction among item IDs to find the rules there by reducing no of database scans. Few algorithms such as Eclat , MCAR uses this approach to reduce the computational time needed to discover CARs.

FOIL Decision Tree Approach
This strategy produces rules for each class in the training dataset. The training data is divided into two subsets -one containing positive cases and other containing negative cases associated with a class 'C'. The algorithm CPAR adopts greedy AC algorithm for generating rules seeking for the best rule condition.

Frequent Pattern (FP) Growth Approach
FP Growth algorithm makes Apriori more efficient in terms of CPU time and memory usage .It constructs FP-tree representing the training dataset .It perform rule items generation in two steps. Initially it forms FP tree ,then extracting frequent rule items directly from FP tree .The CMAR is the first AC algorithm to adapt FP growth learning approach. The memory requirements of FP becomes large for large datasets making it less preferable.

Association Rule Mining based on Weighted Class
In this approach weighted Association rules are produced using weighted support and weighted confidence . The rule items overcoming the thresholds, ie. weighted support and weighted confidence append to the frequent weighted rule items set.

C. Rule Pruning methods in Association Classification
The AC techniques generally suffers from a problem of deriving large set of rules due to highly correlated datasets and considering all attributes for generating rules. This may lead to insignificant and redundant rules. Such rules need to be eliminated for effective and accurate classification. Different pruning methods are introduced in literature based on different concepts such as decision trees, pessimistic estimation error, Chi-Square testing statistics etc that can be applied either during rule generation or building the classifier. Rule Pruning methods add constraints on rule discovery to reduce the size of classifiers. The various pruning techniques implemented by AC algorithms include: Chi-Square Testing, Redundant Rule pruning, Database Coverage, Pessimistic error estimation, Lazy Pruning, Conflicting rules, Laplace Accuracy [4]. It was analyzed that algorithms using lazy pruning produce large number of rules making them unmanageable for accurate classification. In comparision Database coverage and pessimistic error estimation based strategies build moderate sized classifiers. Other effective pruning methods introduced by [ [20 ] proposed a way of selecting the association rules based on interestingness measures such as support, confidence, correlation and soon. The study also presents the distribution of the rule clusters with pattern XiY over different interestingness measures. [21] surveyed different rule pruning methods for removing redundant rules while rule generation. [22] proposed a method for pruning statistically insignificant association rules in the presence of high confidence rules for web usage. This pruning method is based on testing the confidence of rules with low local z-scores. [23] proposed Critical Relative Support(CRS) based pruning to improve the classification accuracy and reduce the size of classifier .Using CRS threshold reduced testing time by reducing number of combination and comparison of test attributes.

D. Text Categorization
Text categorization refers to assigning of class labels to new documents based on the learning of training data during the classification process. An efficient text classifier categorize huge set of text documents in optimized time frame ,with acceptable accuracy level and producing significant and human readable. [24] Presented an approach for automatic text categorization using association rule mining focusing on two issues firstly finding best term association rules and secondly building a text classifier . The approach proposed Association Rule based Classifier By Category(ARC-BC) and dominance factor based pruning to find significant rules for multi class categorization. [25] proposed two algorithms ARC-BC (Association Rule based Classifier By Category) and ARC-AC (Association Rule based Classifier All Category) where in ARC-BC categories are considered one at a time while in ARC-AC the training set is considered as a whole. The experimental results proved ARC-AC performed well being global classifier in terms of effectiveness bit need to be improved by addition of partial and relative support thresholds. [26] proposed text categorization using Association rules and Naïve Bayes classifier using word relations to derive future set from pre classified text documents and then using Naïve Bayes classified on derived feature for final categorization.

CONCLUSION
AC is preferred classification approach originated by integration of association rule mining and classification. AC is capable of building more accurate classifiers than the traditional classifiers such as decision trees KNN, SVM etc. This paper presents different ARM techniques with the improved versions followed by application of ARM. The main focus of paper is to understand Association Classification and its techniques adapting different rule learning approaches moving towards the discussion of rule pruning methods and its importance with an aim to increase the classification accuracy of the classifier. The study further will be experimental comparing the performance of different AC algorithms and rule pruning techniques in terms of classification accuracy and number of rules produced. The study attempts to use high confidence based pruning method in association rule mining for building the fast and efficient classifiers and adapt the same for automatic text document categorization.