COMPARATIVE ANALYSIS OF CLUSTER CONCENTRIC CIRCLE BASED UNDER SAMPLING OVER LOW VERSUS HIGH DIMENSIONAL IMBALANCED DATASETS

Srividhya S; R. Mallika

doi:10.26483/ijarcs.v8i8.4783

PDF

Published: Oct 20, 2017

DOI: https://doi.org/10.26483/ijarcs.v8i8.4783

Keywords:

Classification, C3BUS, Imbalanced dataset, High dimensionality, under sampling, supervised learning

Srividhya S

Bharathiar University

R. Mallika

Abstract

An imbalanced dataset influences the supervised learning model. Most of the existing real world datasets are imbalanced and often high dimensional. The existing classification methods tend to perform extremely well on the majority class and give least importance to the minority class. Most of the solutions provided for the imbalanced datasets do not fit in for the high dimensional imbalanced datasets. This paper compares the performance of an existing balancing method (cluster concentric circle based under sampling-C3BUS) over low dimensional imbalanced dataset versus high dimensional imbalanced datasets. This work shows that C3BUS works quiet well for low dimensional imbalanced dataset when compared to high dimensional imbalanced dataset and proves that class imbalance and high dimensionality are one of the two main issues in supervised learning process.

Downloads

Download data is not yet available.

Issue

Vol. 8 No. 8 (2017): September-October

Section

Articles

COPYRIGHT

Submission of a manuscript implies: that the work described has not been published before, that it is not under consideration for publication elsewhere; that if and when the manuscript is accepted for publication, the authors agree to automatic transfer of the copyright to the publisher.

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work
The journal allows the author(s) to retain publishing rights without restrictions.
The journal allows the author(s) to hold the copyright without restrictions.

References

Y.Liu et al., â€œCombining integrated sampling with SVM ensembles for learning from imbalanced datasetsâ€, Information processing & management, vol.47, no. 4, pp. 617-631 jul. 2011.

Yan-Ping Zhang, Li-Na Zhang, Yong-Cheng Wang, "Cluster-based majority under-sampling approaches for class imbalance learning", 2nd IEEE International Conference on Information and Financial Engineering, pp. 400-404, September 2010.

Chawla NV, Japkpwicz N, Kotcz A (2004) Editorial: Special issue on learning from imbalanced datasets. SIGKDD Explor 6(1):1-6.

Z.Yang, W.tang, A.Shintemirov, and Q.wu, â€œAssociation rule mining based dissolved gas analysis for fault diagnosis of power transformers,â€ IEEE Trans.Stst.,Man,Cybern.C,Appl.Rev.,vol.39.no.6.pp.597-610.

W.Khreich, E.Granger, A.Miri, and R.Sabourin, â€œIterative Boolean combination of classifiers in roc space: An application to anomaly detection with hmms,â€ Pattern Recogn., vol.43, no.8, pp.2732-2752, 2010.

M.A Mazurowski, P.A Habas, J.M Zurada, J.Y Lo, J.A. Baker, and G.D Tourassi, â€œTraining neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance,â€ Neural Netw., vol.21, no 2-3, pp.427-436, 2008.

M.Kubat, R.C.Holte, and S.Matwin, â€œMachine Learning in detection of oil spills in satellite radar images,: Mach. Learn., vol 30, pp.295-215, 1998.

Haibo He, Edwardo A.Garcia, Learning from Imbalanced data IEEE transactions on Knowledge and data engineering vol. 21 NO 9, Sep 2009.

Sotiris Kotsiantis, Dimitris Kanellopoulos, Panayiotis Pintelas, Handling imbalanced datasets: A review GESTS International Transactions on Computer Science and Engineering, Vol.30, 2006.

Bee Wah Yap, Khatijahhusna Abd Rani, Hezlin Aryani Abd Rahman, Simon Fong, Zuraida Khairudin, Nik Nairan Abdullah, An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets in Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013), Lecture Notes in Electrical Engineering 285, DOI: 10.1007/978-981-4585-18-7.

Mikel Galar, Alberto Fernandez, Edurne Barrenechea, Humberto Bustince, F Herrera, A Review on Ensembles for the class Imbalance problem: Bagging, Boosting and Hybrid based approaches, IEEE transactions on systems, Man and cybernetics- Part C:Applications and Reviews.

S.Srividhya, R.Mallika, â€œCluster concentric circle based under sampling to handle imbalanced dataâ€ Middle East Journal of Scientific Research, Vol. 24, pp.314-319, 2016.

N. Japkowicz, â€œLearning from imbalanced data sets: A comparison of various strategies,â€ in Proc. AAAI Workshop Learn. From Imbalanced Data Sets, 2000, pp. 10â€“15.

N. V. Chawla, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer, â€œSMOTE: Synthetic minority oversampling technique,â€ J. Artif. Intell. Res., vol. 16, pp. 321â€“357, 2002

R. Barandela, R. M. Valdovinos, J. S. Sanchez, and F. J. Ferri, â€œThe imbalanced training sample problem: Under or over sampling?â€ in Proc. Joint IAPR Int. Workshops SSPR/SPR,â€ vol. 3138, Lecture Notes in Computer Science, 2004, pp. 806â€“814.

H. Han, W. Y. Wang, and B. H. Mao, â€œBorderline-SMOTE: A new over-sampling method in imbalanced data sets learning,â€ in Proc. ICIC, vol. 3644, Lecture Notes in Computer Science, New York, 2005, pp. 878â€“887.

Poolsawad, N., C. Kambhampati and J.G.F. Cleland 2014. Balancing Class for Performance of Classification with a Clinical Dataset.In the Proceedings of the World Congress on Engineering 2014 Vol I, July 2 - 4, 2014, London, U.K.

Mostafizur Rahman. M. and D. N. Davis. Cluster based undersampling for unbalanced Cardivascular data. In the Proceedings of the world congress on Engineering, 2013 Vol III, WCE 2013, July 3-5, 2013.

Show-Jane Yen and Yue-Shi Lee, 2009. Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications. 36(3): 5718-5727.

Parinaz Sobhani, Herna Viktor and StanMatwin, 2015. Learning from Imbalanced Data Using Ensemble Methods and Cluster-based Undersampling. New Frontiers in Mining Complex Patterns Lecture Notes in Computer Science 8983: 69-83.

Mr. Rushi Longadge, Ms. Snehlata S. Dongre, Dr. Latesh Malik, 2013. Multi-Cluster Based Approach for skewed Data in Data Mining. IOSR Journal of Computer Engineering (IOSR-JCE), pp: 66- 73.

Article Sidebar

Main Article Content

Abstract

Downloads

Article Details

References