RFKNN: ROUGH-FUZZY KNN FOR BIG DATA CLASSIFICATION

Main Article Content

Mohamed A. Mahfouz

Abstract

The K-nearest neighbors (kNN) is a lazy-learning method for classification and regression that has been successfully applied to several application domains. It is simple and directly applicable to multi-class problems however it suffers a high complexity in terms of both memory and computations. Several research studies try to scale the kNN method to very large datasets using crisp partitioning. In this paper, we propose to integrate the principles of rough sets and fuzzy sets while conducting a clustering algorithm to separate the whole dataset into several parts, each of which is then conducted kNN classification. The concept of crisp lower bound and fuzzy boundary of a cluster which is applied to the proposed algorithm allows accurate selection of the set of data points to be involved in classifying an unseen data point. The data points to be used are a mix of core and border data points of the clusters created in the training phase. The experimental results on standard datasets show that the proposed kNN classification is more effective than related recent work with a slight increase in classification time.

Downloads

Download data is not yet available.

Article Details

Section
Articles
Author Biography

Mohamed A. Mahfouz, Ph.D., Faculty of Engineering, Alexandria University, Egypt

Mohammed Mahfouz is a guest assistant professor in computer& communication engineering program, SSP, faculty of engineering, Alexandria University. He received the B.Sc., M.Sc. and PhD degrees in computer and Systems Engineering from the University of Alexandria, Egypt, in 1989 and 1996 and 2009 respectively. He has published several papers in the areas of bioinformatics and machine learning. Also, he is a recognized reviewer for Elsevier and reviewed several papers for other ranked journals.

References

W.-J. Hwang and K.-W. Wen, "Fast kNN classification algorithm based on partial distance search," Electronics letters, vol. 34, pp. 2062-2063, 1998.

Y. Song, J. Liang, J. Lu, and X. Zhao, "An efficient instance selection algorithm for k nearest neighbor regression," Neurocomputing, vol. 251, pp. 26-34, 2017.

R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, Machine learning: An artificial intelligence approach: Springer Science & Business Media, 2013.

S. A. Medjahed, T. A. Saadi, and A. Benyettou, "Breast Cancer Diagnosis by using k-Nearest Neighbor with Different Distances and Classification Rules," International Journal of Computer Applications, vol. 62, 2013.

G. Bhattacharya, K. Ghosh, and A. S. Chowdhury, "An affinity-based new local distance function and similarity measure for kNN algorithm," Pattern Recognition Letters, vol. 33, pp. 356-363, 2012.

M. J. Islam, Q. J. Wu, M. Ahmadi, and M. A. Sid-Ahmed, "Investigating the performance of naive-bayes classifiers and k-nearest neighbor classifiers," in Convergence Information Technology, 2007. International Conference on, 2007, pp. 1541-1546.

T. İnkaya, S. Kayalıgil, and N. E. Özdemirel, "An adaptive neighbourhood construction algorithm based on density and connectivity," Pattern Recognition Letters, vol. 52, pp. 17-24, 2015.

S. Zhang, X. Li, M. Zong, X. Zhu, and D. Cheng, "Learning k for knn classification," ACM Transactions on Intelligent Systems and Technology (TIST), vol. 8, p. 43, 2017.

I. Mani and I. Zhang, "kNN approach to unbalanced data distributions: a case study involving information extraction," in Proceedings of workshop on learning from imbalanced datasets, 2003.

V. Ganganwar, "An overview of classification algorithms for imbalanced datasets," International Journal of Emerging Technology and Advanced Engineering, vol. 2, pp. 42-47, 2012.

M.-L. Hou, S.-L. Wang, X.-L. Li, and Y.-K. Lei, "Neighborhood rough set reduction-based gene selection and prioritization for gene expression profile analysis and molecular cancer classification," BioMed Research International, vol. 2010, 2010.

O. Okun and H. Priisalu, "Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors," Artificial intelligence in medicine, vol. 45, pp. 151-162, 2009.

S. D. Bay, "Nearest neighbor classification from multiple feature subsets," Intelligent data analysis, vol. 3, pp. 191-209, 1999.

X. Wu, C. Zhang, and S. Zhang, "Efficient mining of both positive and negative association rules," ACM Transactions on Information Systems (TOIS), vol. 22, pp. 381-405, 2004.

X. Zhu, L. Zhang, and Z. Huang, "A sparse embedding and least variance encoding approach to hashing," IEEE transactions on image processing, vol. 23, pp. 3737-3750, 2014.

X. Zhu, S. Zhang, Z. Jin, Z. Zhang, and Z. Xu, "Missing value estimation for mixed-attribute data sets," IEEE Transactions on Knowledge and Data Engineering, vol. 23, pp. 110-121, 2011.

Z. Deng, X. Zhu, D. Cheng, M. Zong, and S. Zhang, "Efficient kNN classification algorithm for big data," Neurocomputing, vol. 195, pp. 143-148, 2016.

Z. Pawlak and R. Sets, "Theoretical aspects of reasoning about data," Kluwer, Netherlands, 1991.

L. A. Zadeh, "Fuzzy sets," in Fuzzy Sets, Fuzzy Logic, And Fuzzy Systems: Selected Papers by Lotfi A Zadeh, ed: World Scientific, 1996, pp. 394-432.

A. K. Jain and R. C. Dubes, "Algorithms for clustering data," 1988.

R. J. Hathaway and J. C. Bezdek, "Extending fuzzy and probabilistic clustering to very large data sets," Computational Statistics & Data Analysis, vol. 51, pp. 215-234, 2006.

S. Z. Selim and M. A. Ismail, "Soft clustering of multidimensional data: a semi-fuzzy approach," Pattern Recognition, vol. 17, pp. 559-568, 1984.

"K. Bache,M.Lichman, UCIMach.Learn.Repos.(2013).", ed.

C.-C. Chang and C.-J. Lin, "LIBSVM: a library for support vector machines," ACM transactions on intelligent systems and technology (TIST), vol. 2, p. 27, 2011.

G. Song, J. Rochas, F. Huet, and F. Magoules, "Solutions for processing k nearest neighbor joins for massive data on mapreduce," in Parallel, Distributed and Network-Based Processing (PDP), 2015 23rd Euromicro International Conference on, 2015, pp. 279-287.