Implementation of Ensemble Method on DNA Data Using Various Cross Validation Techniques

Publicado en 3C Tecnología – Volume 11 Issue 2 (Ed. 42)

Autores

Resumen

Abstract

Due to the growing size of datasets, which contain hundreds or thousands of features, feature selection has drawn the interest of many scholars in recent years. Usually, not all columns show important values. As a result, the machine learning models may perform poorly since the noise or unnecessary columns may confound the algorithms. To address this issue, various feature selection methods have been developed to evaluate large dimensional datasets and identify their subsets of pertinent features. The data, however, frequently skews feature selection algorithms. As a result, ensemble approaches have emerged as a substitute that incorporates the benefits of single feature selection algorithms and makes up for their drawbacks. In order to handle feature selection on datasets with large dimensionality, this research aims to grasp the key ideas and links in the process of aggregating feature selection methods. The suggested idea is tested by creating a cross-validation implementation that combines a number of Python packages with functionality to enable the feature selection techniques. By identifying pertinent features in the human, chimpanzee, and dog DNA datasets, the performance of the implementation was demonstrated.

Implementation of Ensemble Method on DNA Data Using Various Cross Validation Techniques