COMPARATIVE ANALYSIS OF CLUSTER CONCENTRIC CIRCLE BASED UNDER SAMPLING OVER LOW VERSUS HIGH DIMENSIONAL IMBALANCED DATASETS

Srividhya S, R. Mallika

Abstract


An imbalanced dataset influences the supervised learning model. Most of the existing real world datasets are imbalanced and often high dimensional. The existing classification methods tend to perform extremely well on the majority class and give least importance to the minority class. Most of the solutions provided for the imbalanced datasets do not fit in for the high dimensional imbalanced datasets. This paper compares the performance of an existing balancing method (cluster concentric circle based under sampling-C3BUS) over low dimensional imbalanced dataset versus high dimensional imbalanced datasets. This work shows that C3BUS works quiet well for low dimensional imbalanced dataset when compared to high dimensional imbalanced dataset and proves that class imbalance and high dimensionality are one of the two main issues in supervised learning process.

Keywords


Classification, C3BUS, Imbalanced dataset, High dimensionality, under sampling, supervised learning

Full Text:

PDF

References


Y.Liu et al., “Combining integrated sampling with SVM ensembles for learning from imbalanced datasets”, Information processing & management, vol.47, no. 4, pp. 617-631 jul. 2011.

Yan-Ping Zhang, Li-Na Zhang, Yong-Cheng Wang, "Cluster-based majority under-sampling approaches for class imbalance learning", 2nd IEEE International Conference on Information and Financial Engineering, pp. 400-404, September 2010.

Chawla NV, Japkpwicz N, Kotcz A (2004) Editorial: Special issue on learning from imbalanced datasets. SIGKDD Explor 6(1):1-6.

Z.Yang, W.tang, A.Shintemirov, and Q.wu, “Association rule mining based dissolved gas analysis for fault diagnosis of power transformers,” IEEE Trans.Stst.,Man,Cybern.C,Appl.Rev.,vol.39.no.6.pp.597-610.

W.Khreich, E.Granger, A.Miri, and R.Sabourin, “Iterative Boolean combination of classifiers in roc space: An application to anomaly detection with hmms,” Pattern Recogn., vol.43, no.8, pp.2732-2752, 2010.

M.A Mazurowski, P.A Habas, J.M Zurada, J.Y Lo, J.A. Baker, and G.D Tourassi, “Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance,” Neural Netw., vol.21, no 2-3, pp.427-436, 2008.

M.Kubat, R.C.Holte, and S.Matwin, “Machine Learning in detection of oil spills in satellite radar images,: Mach. Learn., vol 30, pp.295-215, 1998.

Haibo He, Edwardo A.Garcia, Learning from Imbalanced data IEEE transactions on Knowledge and data engineering vol. 21 NO 9, Sep 2009.

Sotiris Kotsiantis, Dimitris Kanellopoulos, Panayiotis Pintelas, Handling imbalanced datasets: A review GESTS International Transactions on Computer Science and Engineering, Vol.30, 2006.

Bee Wah Yap, Khatijahhusna Abd Rani, Hezlin Aryani Abd Rahman, Simon Fong, Zuraida Khairudin, Nik Nairan Abdullah, An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets in Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013), Lecture Notes in Electrical Engineering 285, DOI: 10.1007/978-981-4585-18-7.

Mikel Galar, Alberto Fernandez, Edurne Barrenechea, Humberto Bustince, F Herrera, A Review on Ensembles for the class Imbalance problem: Bagging, Boosting and Hybrid based approaches, IEEE transactions on systems, Man and cybernetics- Part C:Applications and Reviews.

S.Srividhya, R.Mallika, “Cluster concentric circle based under sampling to handle imbalanced data” Middle East Journal of Scientific Research, Vol. 24, pp.314-319, 2016.

N. Japkowicz, “Learning from imbalanced data sets: A comparison of various strategies,” in Proc. AAAI Workshop Learn. From Imbalanced Data Sets, 2000, pp. 10–15.

N. V. Chawla, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer, “SMOTE: Synthetic minority oversampling technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002

R. Barandela, R. M. Valdovinos, J. S. Sanchez, and F. J. Ferri, “The imbalanced training sample problem: Under or over sampling?” in Proc. Joint IAPR Int. Workshops SSPR/SPR,” vol. 3138, Lecture Notes in Computer Science, 2004, pp. 806–814.

H. Han, W. Y. Wang, and B. H. Mao, “Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning,” in Proc. ICIC, vol. 3644, Lecture Notes in Computer Science, New York, 2005, pp. 878–887.

Poolsawad, N., C. Kambhampati and J.G.F. Cleland 2014. Balancing Class for Performance of Classification with a Clinical Dataset.In the Proceedings of the World Congress on Engineering 2014 Vol I, July 2 - 4, 2014, London, U.K.

Mostafizur Rahman. M. and D. N. Davis. Cluster based undersampling for unbalanced Cardivascular data. In the Proceedings of the world congress on Engineering, 2013 Vol III, WCE 2013, July 3-5, 2013.

Show-Jane Yen and Yue-Shi Lee, 2009. Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications. 36(3): 5718-5727.

Parinaz Sobhani, Herna Viktor and StanMatwin, 2015. Learning from Imbalanced Data Using Ensemble Methods and Cluster-based Undersampling. New Frontiers in Mining Complex Patterns Lecture Notes in Computer Science 8983: 69-83.

Mr. Rushi Longadge, Ms. Snehlata S. Dongre, Dr. Latesh Malik, 2013. Multi-Cluster Based Approach for skewed Data in Data Mining. IOSR Journal of Computer Engineering (IOSR-JCE), pp: 66- 73.




DOI: https://doi.org/10.26483/ijarcs.v8i8.4783

Refbacks

  • There are currently no refbacks.




Copyright (c) 2017 International Journal of Advanced Research in Computer Science