IMPLEMENTATION OF EN
SEMBLE METHOD
ON DNA DATA USIN
G VARIOUS CROSS
VALIDATION TECHNIQUES
B. U. Bawankar
G.H. Raisoni University, Amaravati, India, (India).
Kotadi Chinnaiah
G.H. Raisoni University, Amaravati, India, (India).
Reception: 10/09/2022 Acceptance: 25/09/2022 Publication: 29/12/2022
Suggested citation:
Bawankar, B. U., y Chinnaiah, K. (2022). Implementation of ensemble method on DNA data using various cross
validation techniques. 3C Tecnología. Glosas de innovación aplicadas a la pyme, 11(2), 59-69. https://doi.org/
10.17993/3ctecno.2022.v11n2e42.59-69
https://doi.org/10.17993/3ctecno.2022.v11n2e42.59-69
3C Tecnología. Glosas de innovación aplicadas a la pyme. ISSN: 2254-4143
Ed. 42 Vol. 11 N.º 2 August - December 2022
59
ABSTRACT
Due to the growing size of datasets, which contain hundreds or thousands of features, feature selection
has drawn the interest of many scholars in recent years. Usually, not all columns show important
values. As a result, the machine learning models may perform poorly since the noise or unnecessary
columns may confound the algorithms. To address this issue, various feature selection methods have
been developed to evaluate large dimensional datasets and identify their subsets of pertinent features.
The data, however, frequently skews feature selection algorithms. As a result, ensemble approaches
have emerged as a substitute that incorporates the benefits of single feature selection algorithms and
makes up for their drawbacks. In order to handle feature selection on datasets with large
dimensionality, this research aims to grasp the key ideas and links in the process of aggregating
feature selection methods. The suggested idea is tested by creating a cross-validation implementation
that combines a number of Python packages with functionality to enable the feature selection
techniques. By identifying pertinent features in the human, chimpanzee, and dog DNA datasets, the
performance of the implementation was demonstrated.
KEYWORDS
Cross-validation, Ensemble methods, Feature selection.
https://doi.org/10.17993/3ctecno.2022.v11n2e42.59-69
3C Tecnología. Glosas de innovación aplicadas a la pyme. ISSN: 2254-4143
Ed. 42 Vol. 11 N.º 2 August - December 2022
60
1. INTRODUCTION
In recent years, datasets with a lot of attributes have become more common in several fields.
Microarray categorization serves as the best illustration. Numerous datasets containing this type of
data have been produced as a result of improvements in DNA microarray. The majority of these
datasets show that the ratio of instances to features, which range from 6 to 60 genes, is not greater than
this. However, most of the genes in these datasets do not represent helpful information to support a
machine learning process. In order to efficiently classify microarray data, a pre-processing stage is
therefore required. This article will explain how to do so by choosing a representative subset of genes
from the original set of genes(Mera-Gaona, LÃ…Lopez, Vargas-Canas, and Neumann, 2021)[16]. The
individual success of the ensemble’s basis learners
and the independence of the base learners’ results due to low error and great diversity are the two
major factors that determine how well an ensemble performs. By utilising foundation learners of the
same or different types, diverse base learners can be built. When using the same type of base learners,
diversity is produced by giving each base learner in the ensemble a different training set. Different
training data sets can be created using a variety of techniques, including bagging, boosting, random
subspaces, random forests, and rotation forests. In order to créate a superior composite global model
with more precise and trustworthy estimates or conclusions than can be produced by utilising a single
model, an ensemble methodology combines a group of models, each of which addresses the same
original problem. The fact that different classifier types have distinct inductive biases is one of the key
reasons why ensemble methods are so successful (Gopika1 and Azhagusundari, 2014)[9]. Finding
ways to enhance feature selection on datasets with high dimensionality and few examples is the major
goal of this work. Additionally, cross validation is used in the display of ensemble methods to combine
the benefits of several feature selection algorithms, avoid their biases, and make up for their
shortcomings (Mera-Gaona et al., 2021)[16].
2. ENSEMBLE METHODS
The Ensemble categorization is founded on the idea that several experts can provide more accurate
judgments than a single expert. A single composite model with higher accuracy is produced through
ensemble modelling, which combines the collection of classifiers. According to research, predictions
from a composite model provide better outcomes than predictions from a single model. Since the
previous few decades, ensemble technique research has gained popularity. The outputs of many
classifiers are combined, which minimises generalisation error, according to a number of experimental
tests carried out by machine learning experts. The ensemble approaches are described in this section
(Pandey and Taruna, 2014)[10-11].
(1) Bagging
The bagging technique is used to reduce variance, and the bagging ensemble method’s goal is to
divide the dataset into several subsets for training that are randomly chosen with
replacement(Singh and Pal, 2020) [10]. The Bootstrap sampling approach provides the basis for
bagging. A distinct set of bootstrap samples is produced for each iteration of the procedure in order
to build a unique classifier. During the sample phase of the bootstrap sampling approach, data items
are chosen at random with replacement, meaning that some instances may be repeated or some may
be omitted from the original dataset. Combining all of the classifiers built in the previous phase is
the next stage in the bagging process. To arrive at a final prediction, bagging combines the output
of the classifiers with input from the voting process a12[11-12].
(2) Boosting
Another crucial ensemble method is the boosting classifier. It is used to develop a collection of
classifiers. By fitting classifiers to data and then assessing mistakes, classifiers are serially trained
in the boosting approach (Singh and Pal, 2020) [10]. The weak classifier’s performance is
improved by boosting to a strong level. With the help of reweighting the data instances, it creates
sequential learning classifiers. All the instances are given initial weights that are equal and
https://doi.org/10.17993/3ctecno.2022.v11n2e42.59-69
3C Tecnología. Glosas de innovación aplicadas a la pyme. ISSN: 2254-4143
Ed. 42 Vol. 11 N.º 2 August - December 2022
61
consistent. Each time a learning phase is completed, a new hypothesis is taught, and the examples
are reweighted such that instances that were properly identified during that pase have a lower
weight and the system may focus on instances that weren’t. Instances that were incorrectly
categorised are chosen so they can be correctly categorised in the following learning stage. This
procedure keeps on till the final classifier is built. To arrive at the final forecast, the output of each
classifier is finally merged using majority voting. The Boosting method has been generalised in
AdaBoost(Breiman, )[12].
(3) Random Subspaces
The approach comes in two different types. Each base learner is taught using a distinct feature
subspace of the initial training data set at the first form. Only decision trees may be utilised as the
base learner at the second form (Gopika1 and Azhagusundari, 2014)[9].
(4) Random Forest
Breiman proposed Random Forest. Bagging plus the second kind of random subspaces can be used
to formulate it (Breiman) [12]. The bagging and random subspace methods are combined to induce
the tree. Although each model is a random tree rather than a single model, it differs from bagging
in that each tree is created in accordance with the bootstrap sample of the training set to N. Each
node is divided using yet another random step. Instead of examining all potential splits, a limited
subset of features is randomly picked, and the optimum split is determined from this subset. Across
all trees, the majority vote determines the final categorization [11].
(5) Rotation Forest
Rotation Forest is a brand-new ensemble approach built on the Principal Component Analysis (PCA)
and decision trees. To create a training set for the base classifier using a K axis rotation of the feature
subset, the attribute set F is randomly divided into K subgroups, and PCA is then performed separately
to each subset. By keeping all of the PCA, Rotation Forest maintains all of the information. The basis
classifier for Rotation Forest is the decisión tree(Pandey and Taruna, 2014) [11].
3. CROSS VALIDATION TECHNIQUES
A statistical technique called cross-validation determines how well a trained model will perform on
unobserved data. By training the model on a subset of the input data and testing it on a different
subset, the model’s effectiveness is confirmed. Building a generalised model is assisted by cross-
validation. Cross-validation is helpful for both performance estimate and model selection since
modelling is an iterative process.
Cross-validation involves the following three steps:
i. Split the dataset into two sections: a training section and a testing section.
ii. Use the training dataset to train the model.
iii.
Use the testing set to gauge the model’s effectiveness. Check for problems if the model
doesn’t perform well with the testing set.
If a model can predict accurately for a variety of input data and does well on unknown data, it is stable
and consistent. Evaluation of the stability of machine learning models is aided by crossvalidation.
The dataset has to be divided into three separate sections for training and testing the model:
•
Training Data: Using the training data, the model is trained to discover the dataset’s hidden
characteristics and patterns. The model continually assesses the data to better understand its
behaviour, and then it modifies itself to achieve its goal. Basically, it’s employed to fit the
models.
•
Validation Data: This is used to confirm that the model’s training results were accurate. It aids
in adjusting the hyper-parameters and settings of the model appropriately. The prediction error
for model selection is estimated using the validation data. Validation data helps prevent over-
fitting models.
•
Test Data: Following training, the test data confirms that the trained model is capable of
making precise predictions. It is used to evaluate the generalisation error of the last model
chosen (Hulu and Sihombing, 2020)[1](Jung and A K-Fold, 2015)[7][8](Wu, )[13-14](??, ).
https://doi.org/10.17993/3ctecno.2022.v11n2e42.59-69
3C Tecnología. Glosas de innovación aplicadas a la pyme. ISSN: 2254-4143
Ed. 42 Vol. 11 N.º 2 August - December 2022
62
This paper discusses eight alternative cross-validation approaches, each with advantages and
disadvantages that are stated below
(1) Leave p out cross-validation
An exhaustive cross-validation strategy called leave p-out cross-validation uses the p-observation
as validation data while utilising the remaining data to train the model. This is repeated in all
possible ways on a validation set of p observations and a training set to trim the original sample. In
order to estimate the area under the ROC curve of a binary classifier in a virtually unbiased
manner, leave-pair-out cross-validation, a variation of Leave p-out with p=2, has been suggested
(Kumar, 2020)[14].
(2) Leave one out cross-validation
A thorough cross-validation method is leave-one-out cross-validation. It falls within the leave p-out
cross validation category with the instance of p=1. The first row of a dataset of n rows is chosen for
validation, and the remaining n-1 rows are utilised to train the model. The second row is chosen for
validation and the remainder is used to train the model for the following iteration. Similar to that,
the procedure is repeated up to n operations or phases. Cross-validation techniques i.e. leave p-out
and leave One-out that learn and test in every conceivable way are known as exhaustive cross-
validation techniques. They share the advantages such as straightforward, understandable, and
simple to use and disadvantages such as the model might provide a little bias and a lot of
computing time is needed[13-14].
(3) Holdout cross-validation
The dataset is randomly divided into training and validation data in holdout cross-validation. In
general, training data are split more evenly than test data. The model is created using training data,
and validation data is used to assess the model’s effectiveness. The model becomes better as more
data are used to train it. The holdout cross-validation approach isolates training data from a sizable
amount of data. The advantages for this such as straightforward, understandable, and simple to use
and disadvantages such as it’s not suitable for an unbalanced dataset and a lot of data is not being
used to train the model (Raschka, 2020)[5].
(4) Repeated random sub-sampling validation
The dataset is randomly divided into training and validation in repeated random subsampling
validation, commonly known as Monte Carlo cross-validation. Unlikely k-fold cross-validation
separates the dataset into random splits rather than groups or folds in this. Analysis determines the
number of iterations; it is not a set quantity. The outcomes are then multiplied by the divides.
Advantage for such validation is i.e. there is no relation exists between the number of iterations or
divisions and the fraction of train and validation splits and the disadvantages such as possible that
some samples won’t be used for either training or validation and not appropriate for a dataset with
imbalance [5][14].
(5) k-fold cross-validation
The original dataset is evenly divided into k subparts or folds for k-fold cross-validation. For each
iteration, one of the k-folds or groups is chosen as the validation data, while the remaining (k-1)
groups are chosen as the training data. Until each group is considered as validation and the rest as
training data, the procedure is repeated k times. The mean accuracy of the kmodels validation data
is used to calculate the model’s final accuracy. The model exhibits little bias, low temporal
complexity and both training and validation use the complete dataset is the advantages and the
disadvantage is unsuitable for a dataset with imbalance[1-7](Hulu and Sihombing, 2020)
(Darapureddy, Karatapu, and Tirumala, 2019)(PAYAM REFAEILZADEH, 2008)(Arumugam,
Professor, Department of Statistics, Manonmaniam Sundaranar University, Tirunelveli (Tamil
Nadu), India., Kadhirveni, Priya, Manimannan, Research Scholar, Department of Statistics,
Manonmaniam Sundaranar University, Tirunelveli (Tamil Nadu), India., Assistant Professor,
Department of Statistics, Dr. Ambedkar Government Arts College, Vyasarpadi, Chennai (Tamil
Nadu), India., and Assistant Professor. Department of Statistics, TMG College of Arts and Science,
Chennai (Tamil Nadu), India., 2021)(Raschka, 2020)(Raschka, 2020).
(6) Stratified k-fold cross-validation
https://doi.org/10.17993/3ctecno.2022.v11n2e42.59-69
3C Tecnología. Glosas de innovación aplicadas a la pyme. ISSN: 2254-4143
Ed. 42 Vol. 11 N.º 2 August - December 2022
63