Higher order Analyzes of ASD genetic Data Using Prefix span and PCA methods

The most important aim of data mining is to extract useful information from the datasets. Data mining can extract meaningful patterns from large datasets and it can analyze the dataset to predict and classify the dataset based on user specification. This paper deals with medical database called Gene Expression Omnibus from NCBI database, analysed using data mining techniques. The Microarray data of Autism Spectrum Disorder (ASD), contains 100 genes from 21 ASD children, analysed using unsupervised pattern mining algorithm called PREFIXSPAN to find the sequence pattern and dimensionality reduction as Principal Component Analysis (PCA) algorithm, to find the positively and negatively correlated genes for ASD. From the comparison of algorithms, it infers the genes that are Highly Influence by Autism Spectrum Disorder from the 100 genes.


INTRODUCTION
The microarray data contain huge number of genes and number of samples. And from the data the disease prediction and gene analyzing is done. Pattern mining discovers the most useful and interesting patterns from the database. Principal Component Analysis (PCA) reduces the dimensionality of data. In data mining there are many numbers of variables in data base from which the highly correlated variables are identified using principal component analysis. The visual representation of PCA, shows the pattern in the dataset. This PCA used in to compare the genes as to analyse the gene expression. By using PCA, prefix span analyse to find the positively, negatively, and poorly correlated variables.

II.
LITRATURE REVIEW Yin Li, Yan Cong, Yun Zhao (2016), [1], describes the network motif for coronary artery disease. Differential integrated gene and protein-protein interaction gene are analyzed to interaction pattern is identified by screening of differential network. The network is to find the top 20 network, which is used to identify the coronary artery disease. For screening the network the R package global ancova software where used. The main advantage of screened network motif is, to give the accurate result to identify the coronary artery disease. This network motif method gives the accurate result. Yin Wang, Rudong Li, Yuhua Zhou, Zongxin Ling, Xiaokui Guo, Lu Xie and Lei Liu (2016), [2], classify the disease based on microbial meta-genome. These classifications are done by the method Phylogenetic tree based motif finding algorithm (PMF). The PMF algorithm has three parts that is motif finding, motif sorting and model evaluations. This PMF classifies two diseases, pneumonia and dental caries based on the microbial meta-genome. The main advantage of using PMF is to find the motifs in the training data, from which disease is classified. S.Padmavathi, Ramanujam. E (2015), [5], use the method Multivarient maximal time series motifs to identify the frequently occurring patterns and then it uses a Naive bases classifier to classify the normal and abnormalities signal, the accuracy is 93.33% and 98% of precision rate. This method is used in the application of Electrocardiogram (ECG) to classify the abnormality in ECG signals. Duc-Hau Le, Vu-Tung Dang, Springer Berlin Heidelberg, (2016), [10], in this the network motifs is used for disease prediction. The Random walk restart on heterogeneous network (RWRH) algorithm is used in network motifs, which identify the similarity of network for Alzhemer's disease based on the network it gives the better functionality among the disease. Ontology is used to predict the network similarity Shameek Ghosh , HungNguyen and jinyan Li, (2016), [6], which deals to detect the critical patient events like hypotension and septic shock based on the method, order sequential contrast pattern based classification in the time series sequence for detecting patient event. SVM and HMM is used to classify the disease and this use the arterial pressure series. And this will give the better prediction in ICU outcomes which is the application of this system. Kai Shi, Lin Gao, Lin Gao, Bingbo Wang (2016), [7], used the method called network motifs, the centrality for analysing the shortest path between the nodes. The highest the centrality scores the more significant motifs. This is the application based on colorectal cancer disease. The pathway in the disease, it is a significant pathway which enriches the gene reported related to cancer development. Adnan Ferdous Ashrafi, A.K.M Iqtidar Newaz, Rasif Ajwad Moin (2015), [8], which will find the motifs in DNA sequence by Integer Matching using Hash table indexing, and rank the motifs then calculate the fitness in DNA sequence. The main advantage is the DNA sequence will be accurate and effective.

III.
PROPOSED WORK The proposed work analyzes the genes in Microarray of Autism Spectrum Disorder; the dataset was collected from Gene Expression Omnibus in NCBI database. The dataset was analyzed using unsupervised algorithms called the pattern mining that is PREFIXSPAN and dimensionality reduction algorithm called Principal Component Analysis (PCA).The Dataset Contains 21 ASD children as samples and their respective genes as attributes. And this data is collected from peripheral blood leucocytes associated with gene expression. RNA was prepared from the venous blood of 21 ASD children Each algorithm was implemented in the dataset to analyze the genes that are influenced by ASD. And then finally compare the result of the algorithms to infer the genes that are Highly Influenced by ASD from the 100 genes.
The dataset format is in the following

IV. ALGORITHM TO ANALYZE THE GENES
A frequent pattern mining is a set of item that frequently repeated and form as pattern. This frequent pattern is based on the user specified threshold value. The association, correlation are mine using frequent items in the dataset. Association rules or frequent patterns techniques is used in bioinformatics to analyze, predict the disease.

A. PREFIXSPAN Algorithm
Prefix-projected Sequential pattern mining algorithm helps to identify sequential pattern in data. Prefixspan identifies the combination of various sequence patterns from the dataset. The sequence patterns are visible, that are not less than Min_support value. Prefix Span in microarray data analyze the genes, and then identify the genes, that form as pattern for ASD. This patterns are mine using the Min_support as threshold value.
The Table 2 identifies the genes mostly repeated in many patterns, and those patterns of genes are influenced by Autism Spectrum Disorder.

Gene's pattern in Prefixspan
From the Fig 3, show the genes that form as patterns, and those genes are repeated in number of pattern. Only 57 genes from 100 genes are occurring as different pattern and the remaining genes are not repeated and form as pattern because those genes are not support by the Minimum support threshold value.

B. Principal Component Analysis
Principal component analysis (PCA) has been used in data analysis to reduce the dimensionality of the data in order to simplify analysis. PCA uses mathematical technique to reduce the dimension of data. The standard deviation, covariance, eigenvectors and eigenvalues are used in PCA to analyze correlation in variables. PCA mainly concentrated with identifying correlation in data. Values close to +1 indicate positive correlation, and values near to -1 are negative correlation. Values close to zero is poor correlation and 0 indicates no correlation at all. From the ASD microarray data, PCA analyze from the 100 genes of ASD, only 62 genes are highly correlates with each and those genes are influenced by ASD and it is plotted as two principal components.

Positively and Negatively Correlated Genes
The genes are positively correlated and negatively correlated gene. From 100 genes positively correlated genes are 17 and negatively correlated genes are 45 and poorly correlation genes are 38. This was described in the Fig 3. The negatively correlated gene matches with pattern mining algorithm.

VI. CONCLUTION
The Autism Spectrum Disorder Microarray dataset is analysed to identify the genes which influenced by ASD using the algorithms, sequence pattern mining that is Prefixspan algorithm and Dimensionality reduction algorithm called Principal Component Analysis (PCA) to find the positively and negatively correlated genes. From the comparison table to visualize the genes that are Highly Influenced by ASD Future Scope In this paper only 100 genes are analyzed, but there are more genes in dataset, for future many genes can analyze by the algorithms to identify the genes that are highly influenced by the ASD. VII.