Implementasi SMOTE dan Support Vector Machine Pada Klasifikasi Data Tidak Seimbang Metilasi Arginin

Favorisen Rosyking Lumbanraja; Ester Caroline Lumban Gaol; Dewi Asiah Shofiana; Akmal Junaidi

doi:10.23960/pepadun.v5i1.209

DOI: https://doi.org/10.23960/pepadun.v5i1.209

PDF

Published: Apr 15, 2024

Abstract views: 461

PDF downloads: 382

Keywords:

Imbalanced Data, Methylation, Post-Translational Modification, SMOTE, Support Vector Machine

Favorisen Rosyking Lumbanraja

Jurusan Ilmu Komputer, Fakultas Matematika dan Ilmu Pengetahuan Alam, Universitas Lampung

Ester Caroline Lumban Gaol

Jurusan Ilmu Komputer, Fakultas Matematika dan Ilmu Pengetahuan Alam, Universitas Lampung

Dewi Asiah Shofiana

Jurusan Ilmu Komputer, Fakultas Matematika dan Ilmu Pengetahuan Alam, Universitas Lampung

Akmal Junaidi

Jurusan Ilmu Komputer, Fakultas Matematika dan Ilmu Pengetahuan Alam, Universitas Lampung

Abstract

Imbalanced data is one of the crucial problems in machine learning and data mining which may provide low accuracy in minority classes and makes the classification method not fully optimized. The Arginine Methylation dataset for example, gives a large amount of imbalanced data. Methylation is one of the post-translational modification processes that occurs in arginine protein which affects signal transduction and RNA binding inside cytoplasms. Therefore, it is essential to handle imbalanced data for classification. Synthetic Minority Oversampling Technique (SMOTE) is an algorithm for solving imbalanced data in classification using the concept of k-nearest neighbors. Support Vector Machine (SVM) is a supervised learning method which splits datasets using hyperplane and maximize margin distance. In this research, the arginine methylation dataset is divided into three experimental data, which consists of training data, testing data, and independen data. Data processing goes through a series of steps; data pre-processing (clean redundance data), feature extraction (generates 159 feature dimensions), SMOTE and SVM modeling, and classification testing using 10-fold cross-validation and confusion matrix. The accuracy of training data is 100% in RBF kernel, whereas testing data gives a low accuracy of 65,90% in linear kernel. Independen data have decent accuracy in linear kernel by 98,50% percentage.

How to Cite

Lumbanraja, F. R., Gaol, E. C. L., Shofiana, D. A., & Junaidi, A. (2024). Implementasi SMOTE dan Support Vector Machine Pada Klasifikasi Data Tidak Seimbang Metilasi Arginin. Jurnal Pepadun, 5(1), 27–37. https://doi.org/10.23960/pepadun.v5i1.209

Issue

Vol. 5 No. 1 (2024): April

DOI

https://doi.org/10.23960/pepadun.v5i1.209

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

References

Q. Li & S. Shah, Structure-Based Virtual Screening BT - Protein Bioinformatics: From Protein Modifications and Networks to Proteomics, 2017.

P. V. Hornbeck, B. Zhang, B. Murray, J. M. Kornhauser, V. Latham, & E. Skrzypek, PhosphoSitePlus, 2014: Mutations, PTMs & Recalibrations, Nucleic Acids Research, Vol. 43, no. D1, pp. D512–D520, 2015, doi: 10.1093/nar/gku1267.

W. R. Qiu, B. Q. Sun, X. Xiao, Z. C. Xu, & K. C. Chou, iPTM-mLys: Identifying Multiple Lysine PTM Sites and Their Different Types, Bioinformatics, Vol. 32, No. 20, pp. 3116–3123, 2016, doi: 10.1093/bioinformatics/btw380.

J. D. Gary & S. Clarke, RNA and Protein Interactions Modulated by Protein Arginine Methylation., Progress in Nucleic Acid Research and Molecular Biology, Vol. 61. pp. 65–131, 1998, doi: 10.1016/s0079-6603(08)60825-9.

P. Kumar, J. Joy, A. Pandey, & D. Gupta, PRmePRed: A protein Arginine Methylation Prediction Tool,” PLoS One, Vol. 12, No. 8, 2017, doi: 10.1371/journal.pone.0183318.

F. R. Lumbanraja, W. Mudyaningsih, B. Hermanto, & A. Syarif, Implementasi Metode Random Forest Untuk Prediksi Posisi Metilasi Pada Sekuens Protein, in Seminar Nasional Sains, Matematika, Informatika, dan Aplikasinya, 2019, pp. 105–112.

M. R. Faisal, Seri Belajar Data Science: Klasifikasi dengan Bahasa Pemrograman R, no. February. 2016.

N. Noorhalim, A. Ali, & S. M. Shamsuddin, Handling Imbalanced Ratio for Class Imbalance Problem Using SMOTE, in Proceedings of the Third International Conference on Computing, Mathematics and Statistics (iCMS2017), 2019, pp. 19–30.

N. V. Chawla, K. W. Bowyer, L. O. Hall, & W. P. Kegelmeyer, SMOTE: Synthetic Minority Over-Sampling Technique, Journal of Artificial Intelligence Research, Vol. 16, pp. 321–357, 2002, doi: 10.1613/jair.953.

O. Maimon & L. Rokach, Data Mining and Knowledge Discovery Handbook, Springer, 2010.

J. Terzic, E. Terzic, R. Nagarajah, & M. Alamgir, Ultrasonic Fluid Quantity Measurement in Dynamic Vehicular Applications: A Support Vector Machine Approach, Springer Cham, 2013.

M. Kuhn & K. Johnson, Applied Predictive Modeling, Springer New York, 2013.

N. Bharill, A. Tiwari, & A. Rawat, A Novel Technique of Feature Extraction with Dual Similarity Measures for Protein Sequence Classification, in Procedia Computer Science, 2015, Vol. 48, No. C, pp. 795–801, doi: 10.1016/j.procs.2015.04.217.

G. James, D. Witten, T. Hastie, & R. Tibshirani, An Introduction to Statistical Learning with Applications in R, Springer New York, 2000.

M. Bekkar, H. K. Djemaa, & T. A. Alitouche, Evaluation Measures for Models Assessment over Imbalanced Data Sets, Journal of Information Engineering and Application, Vol. 3, No. 10, pp. 27–38, 2013.

Article Sidebar

Main Article Content

Abstract

Article Details

References