DATA PREPROCESSING: A PRELIMINARY

STEP FOR WEB DATA MINING

Huma Jamshed

Sir Syed University of Engineering and Technology. University of Karachi. Karachi

(Pakistan)

E–mail: humajamshed@yahoo.com

M. Sadiq Ali Khan

Sir Syed University of Engineering and Technology. University of Karachi. Karachi

(Pakistan)

E–mail: msakhan@uok.edu.pk

Muhammad Khurram

Sir Syed University of Engineering and Technology. University of Karachi. Karachi

(Pakistan)

E–mail: muhammadkhurram@gmail.com

Syed Inayatullah

Sir Syed University of Engineering and Technology. University of Karachi. Karachi

(Pakistan)

E–mail: inayat@uok.edu.pk

Sameen Athar

Sir Syed University of Engineering and Technology. University of Karachi. Karachi

(Pakistan)

E–mail: sameenathar@yahoo.com

Recepción: 05/03/2019 Aceptación: 12/04/2019 Publicación: 17/05/2019

Citación sugerida:

Jamshed, H., Ali Khan, M. S., Khurram, M., Inayatullah, S. y Athar, S. (2019).

Data Preprocessing: A preliminary step for web data mining. 3C Tecnología. Glosas de

innovación aplicadas a la pyme. Edición Especial, Mayo 2019, pp. 206–221. doi: http://dx.doi.

org/10.17993/3ctecno.2019.specialissue2.206–221

Suggested citation:

Jamshed, H., Ali Khan, M. S., Khurram, M., Inayatullah, S. & Athar, S. (2019).

Data Preprocessing: A preliminary step for web data mining. 3C Tecnología. Glosas de

innovación aplicadas a la pyme. Special Issue, May 2019, pp. 206–221. doi: http://dx.doi.

org/10.17993/3ctecno.2019.specialissue2.206–221

3C Tecnología. Glosas de innovación aplicadas a la pyme. ISSN: 2254–4143

208208

ABSTRACT

In recent years immense growth of data i.e. big data is observed resulting in

a brighter and more optimized future. Big Data demands large computational

infrastructure with high–performance processing capabilities. Preparing big data

for mining and analysis is a challenging task and requires data to be preprocessed

to improve the quality of raw data. The data instance representation and quality

are foremost. Data preprocessing is preliminary data mining practice in which raw

data is transformed into a format suitable for another processing procedure. Data

preprocessing improves the data quality by cleaning, normalizing, transforming

and extracting relevant feature from raw data. Data preprocessing signicantly

improve the performance of machine learning algorithms which in turn leads to

accurate data mining. Knowledge discovery from noisy, irrelevant and redundant

data is a dicult task therefore precise identication of extreme values and outlier,

lling up missing values poses challenges. This paper discusses various big data

pre–processing techniques in order to prepare it for mining and analysis tasks.

KEYWORDS

Big Data, Data Pre–processing, Data mining, Data preparation, Text Pre–

processing.

209

Edición Especial Special Issue Mayo 2019

DOI: http://dx.doi.org/10.17993/3ctecno.2019.specialissue2.206-221

209

1. INTRODUCTION

Year after year, organizations have realized the benets that big data analytics

provides. Data scientist and researchers demands for the evolution of current

practices for processing raw data. Automated Information extraction is impossible

from the huge data repository as most data is unstructured. Cloud computing

services have also lead us with a growing rate of data on the web as these services

are cost–eective and easy to use. This phenomenon undoubtedly signies a

challenge for the data scientist and analyst, therefore Big Data characterized as

very high volume, velocity and variety require new high–performance processing

(Xindong, Xingquan, Gong–Qing & Ding, 2014). Process of extraction of relevant

and useful information from the data deluge is known as data mining which is

utterly dependent on the quality of data. The raw data is usually vulnerable to

noise, and is incomplete or inconsistent and contain outlier values. Thus, this data

has to be processed prior to the application of data mining (Alasadi & Bhaya,

2017).

Data preprocessing involves the transformation of the raw dataset into an

understandable format. Preprocessing data is a fundamental stage in data mining

to improve data eciency. The data preprocessing methods directly aect the

outcomes of any analytic algorithm; however, the methods of pre–processing

may vary for the area of application. Data pre–processing is a signicant

stage in the data mining process. According to a report by Aberdeen Group,

data preparation refers to any action intended to increase the quality, usability,

accessibility, or portability of data. The ultimate objective of data preparation is

to allow analytical systems with clean and consumable data to be transformed

into actionable insights. Data preprocessing embrace numerous practices such

as cleaning, integration, transformation and reduction. The preprocessing phase

may consume a substantial amount of time but the outcome is a nal data set,

which is anticipated correct and benecial for further data mining algorithms.

3C Tecnología. Glosas de innovación aplicadas a la pyme. ISSN: 2254–4143

210210

Figure 1. Knowledge Discovery Process in Data Mining.

2. BACKGROUND

The raw data available on data warehouse, data marts, database les (Jiawei,

Micheline & Jian, 2012) are generally not organized for analysis as it may be

incomplete, inconsistent or it may be distributed into a various table or represented

in a dierent format, in short, it is dirty. The process of discovering knowledge

from the massive chronological data sources is called as Knowledge Discovery

in databases (KDD) or Data Mining (Malley, Ramazzotti & Wu, 2016; Gupta

& Gurpreet, 2009). It is the era of big data and every eld of life are generating

data at a drastic level. The most challenging task is to gain the right information

from present data sources.

The task of reorganizing data is known as data preparation. It is used to discover

the anticipated knowledge. It incorporates understanding domain based problem

under consideration and then a collection of targeted data to achieve anticipated

goals (Gülser, İnci & Murat, 2011). Forrester estimates up to 80 per cent of data

analyst time is consumed in preparing data (Goetz, 2015). The selected data

can then be preprocessed for data mining. Data pre–processing is the nest

solution to increase data quality. Data preprocessing includes cleansing of data,

normalization of data, transformation, feature extraction and selection, etc. The

processed data is the training set for the machine learning algorithm.

211

Edición Especial Special Issue Mayo 2019

DOI: http://dx.doi.org/10.17993/3ctecno.2019.specialissue2.206-221

211

3. DATA PRE–PROCESSING STAGES

3.1. DATA CLEANING

The rst stage of data preprocessing is Data cleaning which recognizes partial,

incorrect, imprecise or inappropriate parts of the data from datasets (Tamraparni

& Theodore, 2003). Data cleaning may eliminate typographical errors. It may

ignore tuple contains missing values or alter values compared to a known list

of entities. The data then becomes consistent with other data sets available in

the system. Precisely, data cleaning comprises the following four basic steps as

described in Table 1.

Table 1. Data Cleaning Steps.

Steps Description

Data Analysis Dirty data detection by reviewing dataset, quality of data, meta data.

Dene Work Flow

Dene the cleaning rules by considering heterogeneity degree among diverse

data source, then make the work ow order of cleaning rules such as cleaning

particular data type, condition, strategy to apply etc.

Execute dened rules

Rendering the dened rules on source dataset process, and display resulted in

clean data to the user.

Verication

Verify the accuracy and efciency of the cleaning rules whether it content user

requirements.

Step 2–3 repetitively executed till all problems related to data quality get solved.

Repeat steps 1–4 until user requirements are met to clean the data. Handling

missing values is dicult as improperly handled the missing values may lead to

poor knowledge extracted (Hai & Shouhong, 2009). Expectation–Maximization

(EM) algorithm, Imputation, ltering are generally considered for handling

missing values (“Expectation maximization algorithm”). Various data cleansing

solutions apply validated data set on dirty data in–order to clean it. Some tools

use data enhancement techniques which makes incomplete data set complete

by the addition of related information. Binning methods can be used to remove

noisy data. Clustering technique is used to detect outliers (Jiawei, et al., 2012).

Data can also be smooth out by tting it into a regression function. Numerous

regression procedures such as linear, multiple or logistic regression are used to

regulate regression function.

3C Tecnología. Glosas de innovación aplicadas a la pyme. ISSN: 2254–4143

212212

3.2. DATA INTEGRATION

Data Integration is the method of merging data derived from dierent sources

of data into a consistent dataset. Data on the web is expanding in size and

complexity, and is either unstructured or semi–structures. Integration of data

is an extremely cumbersome and iterative process. The considerations during

the integration process are mostly related to standards of heterogeneous data

sources. Secondly, the process of integrating new data sources to the existing

dataset is time–consuming, ultimately results in inappropriate consumption of

valuable information. ELT (Extract–Transform–Load) tools are used to handle a

larger volume of data; it integrates diverse sources into a single physical location,

provides uniform conceptual schemas and provides querying capabilities.

3.3. DATA TRANSFORMATION

Raw data is usually transformed into a format suitable for analysis. Data can be

normalized for instance transformation of the numerical variable to a common

range. Data normalization can be achieved using range normalization technique

or z–score method. Categorical data can also be transformed using aggregation

which merges two or more attributes into a single attribute. Generalization can

be applied on low–level attributes which are transformed to a higher level.

3.4. DATA REDUCTION

Multifaceted exploration of huge data sources may consume considerable time

or even be infeasible. When the number of predictor variables or the number

of instances becomes large, mining algorithms suer from dimensionality

handling problems (Jiawei, et al., 2012). The last stage of data preprocessing is

data reduction. Data reduction makes input data more eective in representation

without loosening its integrity. Data reduction may or may not be lossless. The

end database may contain all the information of the original database in well–

organized format (Bellatreche & Chakravarthy, 2017). Encoding techniques,

hierarchy distribution data cube aggregation can be used to reduce the size

of the dataset. Data reduction harmonizes feature selection process. Instance

213

Edición Especial Special Issue Mayo 2019

DOI: http://dx.doi.org/10.17993/3ctecno.2019.specialissue2.206-221

213

selection (Vijayarani, Ilamathi & Nithya, 2015) and Instance generation are two

approaches used by data mining algorithm to reduce data size.

4. WEB DATA PREPROCESSING FRAMEWORK

World Wide Web is a huge repository of an awful textual data most of it being

created on a daily basis, reaching from structured to semi–structured to completely

unstructured (Andrew, 2015). How can we utilize that data in a productive way?

What can we do with it? The answer to these two questions is totally dependent

on what is our objective.

Figure 2. Framework for web content Pre–processing.

To leverage the availability of all of this data, it has to be preprocessed which

entails various steps but it may or not apply to a given task, but usually plunge

below the broad categories of tokenization, normalization, and substitution.

• Tokenization; in textual data preprocessing tokenization is used to spit long

strings of text into smaller one for example sentences can be tokenized into

words, etc. It is also known as text segmentation or lexical analysis.

• Normalization; It generally refers to a series of related tasks in order to

places all words on equal footing or uniformity. For instance performing

stemming, lemmatization, changing the case upper to lower or lower to upper,

punctuation, space or stop words removal, the substitution of numbers with

their equivalent words etc.

3C Tecnología. Glosas de innovación aplicadas a la pyme. ISSN: 2254–4143

214214

• Substitution or Noise Removal; text data on websites is wrapped in HTML or

XML tags, pattern matching or regular expression can be used to extract desired

text by removing HTML, XML, etc. markup and metadata.

5. CASE STUDY

Our objective is to do preprocessing on the predetermined body of text; so that

we are left with artifact’s which will be more valuable and meaningful for any text

mining algorithm. The approach proposed here is fully applicable to any web

page content. We will remove noise, in our case, it is HTML tags and substitutes

English language contraction. Then the content will be tokenized and nally,

we are going to normalize the text. We have used PHP as a scripting language

to perform preprocessing on the text. We have explored PHP Natural Language

Toolkit for tokenization. Figure 3 shows a dummy HTML page content certainly,

but the steps to preprocessing this data are fully transferable.

Figure 3. Basic HTML page content.

215

Edición Especial Special Issue Mayo 2019

DOI: http://dx.doi.org/10.17993/3ctecno.2019.specialissue2.206-221

215

5.1. NOISE REMOVAL AND SUBSTITUTION

The data preprocessing pipeline will start with noise removal as it is not task

depended. The line of codes in Figure 4 reads in the text le called sample.txt

which contains dummy HTML data shown in Figure 3. It calls PHP built–in

function to strip of HTML tags.

Figure 4. Code to strip HTML tags.

It is benecial to remove English language contraction with their expansion

before tokenization as it will split word such as “didn’t” into “did” and “n’t”

rather than “did” and “not”. We implemented contradiction expansion by calling

list of contraction available in MYSQL database and then comparing it with our

content. It then replaced every occurrence of matched contraction will expansion.

Figure 5. Substitution of contractions.

3C Tecnología. Glosas de innovación aplicadas a la pyme. ISSN: 2254–4143

216216

Figure 6. Text after de–noising.

5.2. TOKENIZATION

For tokenization, we have used PHP Natural Language Processing (NLP) toolkit.

PHP supports various kinds of tokenization under tokenizers namespace. We are

using RegexTokenizer.

217

Edición Especial Special Issue Mayo 2019

DOI: http://dx.doi.org/10.17993/3ctecno.2019.specialissue2.206-221

217

Figure 7. Tokenization using PHP NLP toolkit.

Figure 8. Words Token.

5.3. NORMALIZATION

For text normalizing we will perform (1) stemming (2) everything else.

Stemming: The aim of this step is to condense inectional forms of a word to a

common base form. For instance: cars to car

3C Tecnología. Glosas de innovación aplicadas a la pyme. ISSN: 2254–4143

218218

Figure 9. Stemming English language words.

Everything Else: This step will transform all word into lowercase, remove non–

ascii words, remove punctuations, replace numbers, and remove stop word.

Figure 10. PHP functions for text normalization.

219

Edición Especial Special Issue Mayo 2019

DOI: http://dx.doi.org/10.17993/3ctecno.2019.specialissue2.206-221

219

Figure 11. Final output after applying all preprocessing steps.

The simple text data preprocessing process results are shown in Figure 11.

6. CONCLUSION

Any data analysis algorithm will fail to discover hidden pattern or trend from

data if the dataset under observation is inadequate, irrelevant or incomplete.

Thus data preprocessing is a central phase in any data analysis process. The

preprocessing of data resolves numerous kinds of problems such as noisy,

redundancy, missing values, etc. High quality results are only achievable with

high quality of data which in turn also reduce the cost for data mining. The

foundation of decision making system in any organization is the three C’s

properties of data i.e. Completeness, Consistency and Correctness. Deprived

quality of data quality eects decision making process which eventually decreases

customer’s satisfaction. Furthermore larger dataset aects the performance of

any machine learning algorithm, therefore instance selection lessens data and is

ecient approach to make machine learning algorithm work eectively.

ACKNOWLEDGEMENTS

This work was supported by Department of Computer Science University of

Karachi. We are thankful to our colleagues from computer science department

who provided awareness and expertise that signicantly helped this work. The

authors would like to thank the anonymous reviewers for their valuable and

constructive comments on improving the paper.

3C Tecnología. Glosas de innovación aplicadas a la pyme. ISSN: 2254–4143

220220

REFERENCES

Alasadi, S. & Bhaya, W. (2017). Review of Data Preprocessing Techniques in

Data Mining. Journal of Engineering and Applied Sciences, 12(16), pp. 4102–4102.

doi: http://dx.doi.org/10.3923/jeasci.2017.4102.4107

Andrew, K. (2015). The research of text preprocessing eect on text documents

classication eciency. International Conference Stability and Control Processes IEEE,

St. Petersburg, Russia.

Bellatreche, L. & Chakravarthy, S. (2017). Big Data Analytics and Knowledge

Discovery. Proceeding of 19th International Conference DAWak Lyon France.

Expectation maximization algorithm, Wikipedia, Retrieved February 10, 2019, from

https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_

algorithm

Goetz, M. (2015). Three ways data preparation tools help you get ahead of Big

Data. Retrieved from https://go.forrester.com/blogs/15–02–17–3_ways_

data_preparation_tools_help_you_get_ahead_of_big_data/

Gülser, K., İnci, B. & Murat, C. (2011). A review of data mining applications

for quality improvement in manufacturing industry. Expert System with

application, 38(10), pp. 13448–13467. doi: http://dx.doi.org/10.1016/j.

eswa.2011.04.063

Gupta, V. & Gurpreet, S. (2009). A Survey of Text Mining Techniques and

Applications. Journal of Emerging Technologies in Web Intelligence, 1(1), pp. 60–76.

Hai, W. & Shouhong, W. (2009). Mining incomplete survey data through

classication. Knowledge and Information Systems Springer , 24(2), pp. 221–233. doi:

http://dx.doi.org/10.1007/s10115–009–0245–8

Jiawei, H., Micheline,K. & Jian, P. (2012). Data Mining Concepts and

Techniques. (3rd ed.), USA: Morgan Kaufmann.

221

Edición Especial Special Issue Mayo 2019

DOI: http://dx.doi.org/10.17993/3ctecno.2019.specialissue2.206-221

221

Malley, B., Ramazzotti, D. & Wu, J. (2016). Data Pre–processing; Secondary

Analysis of Electronic Health Records. Springer. Retrieved from https://link.

springer.com/book/10.1007/978–3–319–43742–2

Tamraparni, D. & Theodore, J. (2003). Exploratory data mining and data

cleaning. New York, USA, John Wiley & Sons.

Vijayarani, S., Ilamathi, M., & Nithya, M. (2015). Preprocessing Techniques

for Text Mining – An Overview. International Journal of Computer Science &

Communication Networks, 5(1), pp. 7–16.

Xindong, W., Xingquan, Z., Gong–Qing, W. & Ding, W. (2014). Data

Mining with Big Data. IEEE transactions on knowledge and data engineering, 26(1),

pp. 97–107. doi: http://dx.doi.org/10.1109/TKDE.2013.109