Implementasi SMOTE dan Support Vector Machine Pada Klasifikasi Data Tidak Seimbang Metilasi Arginin

Main Article Content

Favorisen Rosyking Lumbanraja
Ester Caroline Lumban Gaol
Dewi Asiah Shofiana
Akmal Junaidi

Abstract

Imbalanced data is one of the crucial problems in machine learning and data mining which may provide low accuracy in minority classes and makes the classification method not fully optimized.  The Arginine Methylation dataset for example, gives a large amount of imbalanced data.  Methylation is one of the post-translational modification processes that occurs in arginine protein which affects signal transduction and RNA binding inside cytoplasms.  Therefore, it is essential to handle imbalanced data for classification.  Synthetic Minority Oversampling Technique (SMOTE) is an algorithm for solving imbalanced data in classification using the concept of k-nearest neighbors.  Support Vector Machine (SVM) is a supervised learning method which splits datasets using hyperplane and maximize margin distance.  In this research, the arginine methylation dataset is divided into three experimental data, which consists of training data, testing data, and independen data.  Data processing goes through a series of steps;  data pre-processing (clean redundance data), feature extraction (generates 159 feature dimensions), SMOTE and SVM modeling, and classification testing using 10-fold cross-validation and confusion matrix.  The accuracy of training data is 100% in RBF kernel, whereas testing data gives a low accuracy of 65,90% in linear kernel.  Independen data have decent accuracy in linear kernel by 98,50% percentage.

Article Details

How to Cite
Lumbanraja, F. R., Gaol, E. C. L., Shofiana, D. A., & Junaidi, A. (2024). Implementasi SMOTE dan Support Vector Machine Pada Klasifikasi Data Tidak Seimbang Metilasi Arginin. Jurnal Pepadun, 5(1), 27–37. https://doi.org/10.23960/pepadun.v5i1.209

References

Q. Li & S. Shah, Structure-Based Virtual Screening BT - Protein Bioinformatics: From Protein Modifications and Networks to Proteomics, 2017.

P. V. Hornbeck, B. Zhang, B. Murray, J. M. Kornhauser, V. Latham, & E. Skrzypek, PhosphoSitePlus, 2014: Mutations, PTMs & Recalibrations, Nucleic Acids Research, Vol. 43, no. D1, pp. D512–D520, 2015, doi: 10.1093/nar/gku1267.

W. R. Qiu, B. Q. Sun, X. Xiao, Z. C. Xu, & K. C. Chou, iPTM-mLys: Identifying Multiple Lysine PTM Sites and Their Different Types, Bioinformatics, Vol. 32, No. 20, pp. 3116–3123, 2016, doi: 10.1093/bioinformatics/btw380.

J. D. Gary & S. Clarke, RNA and Protein Interactions Modulated by Protein Arginine Methylation., Progress in Nucleic Acid Research and Molecular Biology, Vol. 61. pp. 65–131, 1998, doi: 10.1016/s0079-6603(08)60825-9.

P. Kumar, J. Joy, A. Pandey, & D. Gupta, PRmePRed: A protein Arginine Methylation Prediction Tool,” PLoS One, Vol. 12, No. 8, 2017, doi: 10.1371/journal.pone.0183318.

F. R. Lumbanraja, W. Mudyaningsih, B. Hermanto, & A. Syarif, Implementasi Metode Random Forest Untuk Prediksi Posisi Metilasi Pada Sekuens Protein, in Seminar Nasional Sains, Matematika, Informatika, dan Aplikasinya, 2019, pp. 105–112.

M. R. Faisal, Seri Belajar Data Science: Klasifikasi dengan Bahasa Pemrograman R, no. February. 2016.

N. Noorhalim, A. Ali, & S. M. Shamsuddin, Handling Imbalanced Ratio for Class Imbalance Problem Using SMOTE, in Proceedings of the Third International Conference on Computing, Mathematics and Statistics (iCMS2017), 2019, pp. 19–30.

N. V. Chawla, K. W. Bowyer, L. O. Hall, & W. P. Kegelmeyer, SMOTE: Synthetic Minority Over-Sampling Technique, Journal of Artificial Intelligence Research, Vol. 16, pp. 321–357, 2002, doi: 10.1613/jair.953.

O. Maimon & L. Rokach, Data Mining and Knowledge Discovery Handbook, Springer, 2010.

J. Terzic, E. Terzic, R. Nagarajah, & M. Alamgir, Ultrasonic Fluid Quantity Measurement in Dynamic Vehicular Applications: A Support Vector Machine Approach, Springer Cham, 2013.

M. Kuhn & K. Johnson, Applied Predictive Modeling, Springer New York, 2013.

N. Bharill, A. Tiwari, & A. Rawat, A Novel Technique of Feature Extraction with Dual Similarity Measures for Protein Sequence Classification, in Procedia Computer Science, 2015, Vol. 48, No. C, pp. 795–801, doi: 10.1016/j.procs.2015.04.217.

G. James, D. Witten, T. Hastie, & R. Tibshirani, An Introduction to Statistical Learning with Applications in R, Springer New York, 2000.

M. Bekkar, H. K. Djemaa, & T. A. Alitouche, Evaluation Measures for Models Assessment over Imbalanced Data Sets, Journal of Information Engineering and Application, Vol. 3, No. 10, pp. 27–38, 2013.