211
Edición Especial Special Issue Mayo 2019
DOI: http://dx.doi.org/10.17993/3ctecno.2019.specialissue2.206-221
211
3. DATA PRE–PROCESSING STAGES
3.1. DATA CLEANING
The rst stage of data preprocessing is Data cleaning which recognizes partial,
incorrect, imprecise or inappropriate parts of the data from datasets (Tamraparni
& Theodore, 2003). Data cleaning may eliminate typographical errors. It may
ignore tuple contains missing values or alter values compared to a known list
of entities. The data then becomes consistent with other data sets available in
the system. Precisely, data cleaning comprises the following four basic steps as
described in Table 1.
Table 1. Data Cleaning Steps.
Steps Description
Data Analysis Dirty data detection by reviewing dataset, quality of data, meta data.
Dene Work Flow
Dene the cleaning rules by considering heterogeneity degree among diverse
data source, then make the work ow order of cleaning rules such as cleaning
particular data type, condition, strategy to apply etc.
Execute dened rules
Rendering the dened rules on source dataset process, and display resulted in
clean data to the user.
Verication
Verify the accuracy and efciency of the cleaning rules whether it content user
requirements.
Step 2–3 repetitively executed till all problems related to data quality get solved.
Repeat steps 1–4 until user requirements are met to clean the data. Handling
missing values is dicult as improperly handled the missing values may lead to
poor knowledge extracted (Hai & Shouhong, 2009). Expectation–Maximization
(EM) algorithm, Imputation, ltering are generally considered for handling
missing values (“Expectation maximization algorithm”). Various data cleansing
solutions apply validated data set on dirty data in–order to clean it. Some tools
use data enhancement techniques which makes incomplete data set complete
by the addition of related information. Binning methods can be used to remove
noisy data. Clustering technique is used to detect outliers (Jiawei, et al., 2012).
Data can also be smooth out by tting it into a regression function. Numerous
regression procedures such as linear, multiple or logistic regression are used to
regulate regression function.