AN IMPROVED AND EFFICIENT METHOD TO DISCOVER THE FREQUENT PATTERNS FROM TARGETED PATTERNS IN TRANSACTIONAL DATASET USING TPIITR-FPMM

: In Recent years, Data mining is an essential technique to discover useful knowledge from transactional dataset. Association analysis algorithm is one of the vital data mining techniques. It normally catches relationships among items in transactional dataset. Generally they are used to develop the strategy of the future business. The main step of association analysis is to catch the frequent patterns in large transactional dataset. Plenty of methods are available in the literature to catch the frequent patterns. Most of the techniques gave in the literature catch all frequent itemsets for a specified minimum support threshold value. But in some instance, it is desired to examine the existence of some of the few targeted patterns (for example special offer given for group of items to promote the retail sales) in large transactional dataset to develop the strategy of the future business. For this purpose, we previously introduced SIFPMM (Selective Itemsets Frequent Pattern Mining Method) method and TM-PIFPMM (Transaction Merging-Predefined Itemsets Frequent Pattern Mining Method). To improve the performance of TM-PIFPMM, this TPIITR-FPMM (Targeted Patterns Involved Items Transaction Reduction-Frequent Pattern Mining Method) is proposed and the performance of this method is compared with Apriori, FP-Growth, SIFPMM and TM-PIFPMM. The Experimental analysis of TPIITR-FPMM verifies that this method outperforms than Apriori, FP-Growth, SIFPMM and TM-PIFPMM.


I. INTRODUCTION
As with invention of information devices and internet, large amounts of data are produced routinely in the sequence of day to-day management in business, education, banking, health services, environmental protections, social services, retail industry and security. Those data are mainly utilized for accounting and management of the customer relations. Normally, those datasets are very big and constantly increasing and contain a large collection of useful hidden complex features. To extract features or knowledge from such datasets, it needs simple, robust and computationally efficient tools. The data mining provides techniques from computer science, mathematics and statistics for developing such tools [1].
As the consequence, the data mining is encouraged as decision support problems for various business organizations and social service sectors and it is defined as an essential area of research today [2]. It has been impressed by data mining professionals, because of its applicability in various areas such as decision support, banking, retail industry, fraud detection, finance, health services, advertisements, pharmaceutics, government and all sorts of e-businesses [3]. So organizations in the world has been being initiated to recognize that the information collected over years is an essential tactical benefit and it also recognizes that there are forthcoming intelligences secreted in the massive amount of data. Ultimately the data mining can contribute techniques to discover hidden knowledge from such massive amount of data [4], [5]. Data mining can be defined as collection of techniques to discover previously unknown, valid, novel, prized and clear patterns in large dataset automatically [6].
Normally the data mining tasks can strongly be categorized into two types such as predictive tasks and descriptive tasks. Predictive mining tasks will do implication on input data to get the hidden knowledge. The predictive mining approaches comprise of tasks such as classification, regression and deviation. The descriptive mining tasks will mine the general properties of data in the database. The descriptive techniques consist of tasks such as clustering, association mining and sequential mining [7].
Research issues in data mining are normally based on performance study, mining techniques, user requirements, memory requirements and data diversity. So the data mining methodologies must be capable and scalable to the large size of dataset and their execution times [8], [9], [10]. Association rule mining is one of the most popular descriptive data mining techniques. After its presence [11], it has become one of the fundamental data mining tasks and has got remarkable consideration among data mining researchers [12]. Generally it is used to find correlations between variables in a big dataset. For example, in market basket analysis, it can be useful to find out how many of customers buy pencil and eraser together to improve the future strategy of business. Domain expert can use these results to discover the customer purchasing habits to maximize the profit of the organization. So the frequent pattern mining for user requirement is the main problem of association rule mining. The association rule equation of above said problem can be stated as Where 'a' is a variable and buys (a, b) is a predicate and it states that a buyer 'a' buys an item b. This rule states that maximum number of people who purchase pencil also buy eraser [13].
The definition of association rule mining is given as follows. Let I = {i 1 , i 2, …,i m } refers set of products. A nonempty subset of I is termed as itemset and it is denoted as X= {i 1 , i 2,…, i n }. Let D = {t 1 ,t 2 ,….,t k } be a collection of transactions. Each transaction T has one or more products such that T ⊆ I. The total number of products in T defines the size of the itemset and an itemset of size L is denoted as Litemset [14].
Let A, B be a set of items, Association rule will be written as Where A is an antecedent and B is the consequent of the rule. The function of association rule mining is usually controlled by two statistical approaches such as support and confidence [6]. Initially it finds frequent patterns based on least support threshold value and later it will apply confidence threshold value to decide the correlation between frequent patterns. The equations for finding support and confidence can be stated as follows [15].
The rest of the paper is planned as follows: Related works are given in section 2. The proposed algorithm is explained in section 3. Experimental results, evaluation and discussions are given in section 4. The application of proposed method is discussed in section 5. The conclusions and future direction of proposed method is written in section 6.

II. RELATED WORKS
Frequent pattern mining in transactional dataset is main task of association analysis. Initially it is very help in market basket analysis to promote sales in future. Lot of methods have been being introduced in the literature to find frequent patterns according to domain expert requirement. Generally all of those methods can be categorized into following types such as candidate generation [11] and pattern growth [19].
AIS (Agrawal, Imielinski and Swami) is the first algorithm to find frequent patterns and it was presented by Agrawal et al. [11] in 1993 which finds frequent patterns using candidate generation technique. Later the name of AIS algorithm was changed as Apriori by Agrawal et al. [4] in 1994. Many methods or techniques have been being presented to improve the efficiency of Apriori. Nevertheless Apriori algorithm normally suffers from too many numbers of database scans to discover the frequent patterns and occupy more execution time if the number of different items as well as the number of transactions in the dataset increases.
FP-Growth algorithm for finding frequent patterns was introduced by Han in 2000 which uses FP-tree structure for pattern growth. It uses at most two database scans to construct FP-tree. Later it finds frequent patterns by using FP-tree. If the number of different items and number of different transactions are larger in dataset then the construction of FP-tree is very difficult because the complete FP-tree should be maintained in main memory until all necessary frequent patterns to be found. So the construction and maintenance of FP-tree is very difficult and time consuming [19].
We presented SIFPMM [24] to discover the frequent patterns from selective frequent patterns given by domain experts in transactional dataset to improve the policy of the future business. It has been proved that it works better than Apriori and FP-Growth.
Later we suggested TM-PIFPMM [25] to improve the performance of SIFPMM by applying transaction merging technique on dataset and it has been demonstrated that this TM-PIFPMM performed well than SIFPMM.
Even though TM-PIFPMM works better than SIFPMM for finding frequent patterns from targeted patterns given by domain experts, it further desires talented method with modified data structures to find necessary frequent patterns from ever growing database with less mining time than TM-PIFPMM. So this paper introduces the TPIITR-FPMM method to discover frequent patterns from targeted patterns given by the domain experts so that to reduce computing time of TM-PIFPMM.

A. Selection of Targeted Patterns
Let F= {F 1 , F 2… F i } be the collection of frequent patterns got from old dataset to take the future strategic decision and X= {X 1 , X 2 … X j } be the collection of targeted patterns selected by domain expert from F by using his past experience for future strategic decision to improve the profit of the organization. This can be stated in the tuple relational calculus as follow X contains collection of patterns which satisfies the domain expert conditions on F to progress the future business strategy. Otherwise, these targeted patterns X will directly be selected from offer given for some combination of products from total products or items (I) available in the retail organization by domain expert to promote the future sales. This is represented in tuple relational calculus as follows Normally those patterns are gathered early by domain expert and stored in the text file before executing this suggested method.

B. Finding Involved Items
Let X= {X 1 , X 2 … X j } be collection of targeted patterns decided by domain expert as stated in the previous section. The involved items can be got as follows

C. Dataset reduction
The transaction in the dataset is read one by one and the items other than involved items found in the previous section will be removed from the transaction and at the same time the identical transactions are combined as single transaction with record of total number of transactions combined [26] and stored in one table termed as transactions combining  table (TCT). This data transformation operation will definitely reduce the total number of transactions in dataset as less than or equal to 2 P -1 transactions where P denotes the total number of different items counted in involved items. This data reduction phase will significantly decrease execution time of mining required frequent patterns.

D. Presence Calculation Table
This proposed method uses one table that it is termed as Presence Calculation Table (PCT). It uses two field's namely targeted patterns and presence calculation value. The targeted patterns field contains all targeted patterns and the presence calculation field has corresponding presence count of those patterns in transactional dataset. The presence calculation value of each pattern is calculated as the total count of the presence of such pattern in transactional database D. The specimen of calculation of presence of targeted patterns are given in table I
Before mining, the transaction in the datasets are preprocessed and identical transactions are combined so that to reduce the mining time. The Apriori, FP-Growth and SIFPMM did not use any data reduction techniques.TM-PIFPMM and TPIITR-FPMM uses significant data reduction techniques to reduce the original dataset size without changing its original meaning to find the required frequent patterns. From table VI, the proposed TPIITR-FPMM reduces dataset greatly than TM-PIFPMM to find the necessary frequent patterns because it uses both vertical and horizontal data reduction. So this will definitely reduce the mining time. The proposed method was first tested by applying four groups of datasets mentioned above in Apriori, FP-Growth, SIFPMM, TM-PIFPMM and TPIITR-FPMM with 10% minimum support. The table VII shows the corresponding run time of those datasets.
From table VII and figure 1, it is easily observed that the run time for mining process linearly reduced from Apriori to FP-Growth and FP-Growth to SIFPMM, SIFPMM to TM-PIFPMM and TM-PIFPMM to TPIITR-FPMM. Table VIII shows that our proposed algorithms such as SIFPMM, TM-PIFPMM and TPIITR-FPMM use one scan on dataset whereas the other existing algorithms such as the Apriori uses more number of database scans based on the content of dataset and support threshold and the FP-Growth uses 2 database scans. So this action will definitely reduce the computing time of mining process because the read is much reduced.  From table IX, it is understood that even though our proposed methods such as SIFPMM, TM-PIFPMM and TPIITR-FPMM varies in use of memory for its mining process, it approves that those methods use less memory space than Apriori and FP-Growth From table X, our proposed methods such as SIFPMM, TM-PIFPMM and TPIITR-FPMM generate frequent patterns less than or equal to the targeted patterns. So it is very easy to verify and take necessary action to improve the future strategy of the business. From table XI, it can be concluded that the proposed method will take less mining time because it compresses the dataset transactions from 88162 to 63 without changing its original meaning. From figure 2 and table XII, it is simply witnessed that the proposed method takes less mining time than existing methods such as Apriori, FP-Growth, SIFPMM and TM-PIFPMM. It also witnessed that when the support threshold increases, the computing time of Apriori gradually reduces.  Table XIII displays the number of database scans used to accomplish the mining tasks. Our proposed methods use only one scan for mining process. Usually reduction of database will definitely decrease the mining time. From table XIV, Even though the maximum memory needed to mine the required frequent patterns for the proposed methods fluctuate, but it guarantees that those methods take less space than existing Apriori and FP-growth. The table XV demonstrates that our suggested methods for finding frequent patterns from targeted patterns generates less frequent patterns for different support thresholds to promote sales in future compared to other existing methods. It eases to take decision to improve the business in future.

C. Performance Analysis of Proposed Method based on mining Time
It can be evidently understood that the mining time of proposed method is less for finding frequent patterns from targeted patterns with the help of figure 1 and figure 2. The actual time reduction rate of using proposed method against Apriori, FP-Growth, SIFPMM and TM-PIFPMM for synthetically generated dataset with 10% minimum support threshold is given in table XVI and real time retail dataset with various support thresholds are given in table XVII.

V. APPLICATION OF PROPOSED METHOD
The usefulness of proposed method is demonstrated in this section. Usually the retail shop provides various types of offers such as seasonal offer, festival offer and stock clearance offer to promote sales and improve the profit of the organization.
Let it considers that a retail shop contains 10 items such as {1,2,3,4,5,6,7,8,9,10} and targeted patterns such as {{2,3,4},{5,6,8},{2,3,5,6}} to be given as offers combinations and dataset D1 and D2. D1 actually contains 1000 customer purchased transactions before the offers are given and D2 really contains 1000 customer purchased transactions after the offers were provided. The actual occurrence counts (observed frequency) for both datasets are found by using our proposed TPIITR-FPMM method. They are given in  The 2 test at 5% level is applied to find whether the offers promote sales or not to improve the profit of the organization. The formula for finding 2 is written as Where O ij refers observed frequency of targeted patterns and E ij refers the expected frequency of targeted patterns.
It is set the following hypothesises to check whether the offers (targeted patterns) improve the sales of the organization. Let it takes the null hypothesis H 0 as offer and sales are independent and the alternative hypothesis H 1 as offer and sales are dependent and they promote sales and profit.
The contingency table for observed and expected frequency were created for each offer (targeted patterns) mentioned above by using both table XVIII and table XIX. Later the 2 values for above mentioned 3 targeted patterns are calculated and are given table XX [27].

VI. CONCLUSIONS AND FUTURE DIRECTIONS
Finding frequent patterns in large dataset is the fundamental task of association rule mining. Association rule mining finds correlation among items in large dataset to decide the future strategy of the business to improve profit of any organization. This TPIITR-FPMM discovers frequent patterns from the targeted patters suggested by domain experts to make precise decisions in future to maximize the profit of organization. The empirical analysis of proposed method proves that it takes less mining time to mine frequent patterns from targeted patterns than the existing methods such as Apriori, FP-Growth, SIFPMM and TM-PIFPMM even though it varies the dataset and support threshold. Using proposed method and 2 statistical test, it is proved that the sales are improved when it provides offer for combination of items. It has been planned to apply this technique for online dataset in future. Moreover, it should find various real time datasets suitable for this proposed method to improve its proficiency further in future.